


"^^^ 



':L^y.:i 



Aiy.vi. ■'• ■ 'a'. 



^ ^ "^^J^^* ^^ 



^. %-t;' ,^^ 












'""■%, %./ .'^15%% ^^^^^ .^^ft\ \^^^- 





V 



c^^ ; 












.'N 





.•^°x- 




.^'>. 










o V 



^""^ 















^. O c ° " ° -» c) 




MODERN EDUCATION SERIES 






.:i 



INTRODUCTION 

TO THE USE OF 

STANDARDIZED 
TESTS 



DENTON L. GEYER, Ph. D. 



r 



4 



*3he PLYMOUTH PRESS ^ 
CHICAGO 



m , 



'^..r 



MODERN EDUCATION SERIES 

Editid by JAMES E. McDADE 



INTRODUCTION 

TO THE USE OF 

STANDARDIZED 
TESTS 



By 
DENTON L. GEYER, Ph. D. 

Department of Education 
Chicago Normal College 



THE PLYMOUTH PRESS 
CHICAGO 



\\ 



1)\ 






COPYRIGHT. 1922 
THE PLYMOUTH PRESS 



5C1A702:544 



"^.^ 



CONTENTS 

Chapter Page 

I. THE FUNCTIONS OF STANDARDIZED 

TESTS 5 

II. THE DIFFERENT KINDS OF TESTS 17 

III. TESTS OF ABILITY TO LEARN 17 

The Meaning of General Intelligence 17 

Composition of the Tests 18 

Validity of Intelligence Tests 22 

Principal Uses of Intelligence Tests 28 

Selection of an Intelligence Test 34 

Giving the Tests 38 

IV. TESTS OF AMOUNT LEARNED 42 

General Statement 42 

The Tests in Arithmetic 45 

The Tests in Reading 52 

The Tests in Spelling 60 

Tests in Punctuation and Grammar 62 

The Composition Scales 62 

The Handwriting Scales 63 

Other Achievement Tests 66 

"Home-Made" Objective Tests 66 

What to Do 72 

Selecting an Achievement Test 72 

V. PUTTING MEANING INTO SCORES 75 

Tables 75 

The Median 77 

The Quartile Deviation 81 

The Mean Deviation 82 

Th^ Standard Deviation 82 

The Measurement of Relationship 82 

Summary of Chapter 95 



FOREWORD 

This little outline is written to give the classroom 
teacher a brief general survey of the measuring move- 
ment in education. It attempts not so much to bring out 
things that are new as to set forth a certain number of 
the more salient facts about its subject in simple and non- 
technical language. It therefore touches on a consider- 
able range of topics: how the movement has developed 
and what it implies ; what goes into standardized tests and 
what they are used for; how to choose the best from 
among them; and how to interpret their scores. 

The four or five excellent treatises we have on stand- 
ardized tests are longer and cover somewhat different 
ground; some limit their material for the most part to 
descriptions of the existing tests and directions for admin- 
istering them; others give some hint of the meaning of the 
movement as a whole, but introduce such material only 
incidentally and sketchily; others give their principal em- 
phasis to teaching-devices for dealing with the defects 
which standarized tests may reveal; others cover a wider 
range of topics, but couch their ideas in highly technical 
terminology. Few have given full attention to both intelli- 
gence tests and achievement tests within the same covers. 
All use for their explanation many more pages than the 
average teacher feels that she could take time to make her 
way through. 

The present brief booklet will fulfill its purpose if it 
gives the classroom teacher without previous experience 
in this field and without special psychological training a 
short statement of the facts about educational measure- 
ments which is at the same time readable and practically 
useful. 



CHAPTER I 

THE FUNCTIONS OF STANDARDIZED TESTS 

We may think of a test as standardized when it has 
been given under uniform conditions to a sufficiently 
large number of children to allow us to base on these 
scores standards of attainment for other children of the 
same age or school grade. The test is thus standardized 
in two senses — in the sense that the method of giving and 
scoring it is so definitely controlled as to be standardized, 
and in the sense that the scores already secured serve as 
objective standards for other classes. 

The advantages of such a test are obvious. By it the 
teacher can compare her class with other classes in other 
schools. She can find out whether her pupils are as intelli- 
gent, or as proficient in their studies, as the children of 
other cities or of the country in general. She can find out 
in which studies they are up to standard and in which 
below standard, and can distribute her teaching emphasis 
accordingly. She can discover whether her pupils have 
an average amount of native ability, or are decidedly 
inferior or superior. This helps her to decide how rapidly 
to proceed in her work and how much time to give to drills 
and reviews, as well as letting her know whether the re- 
sults she is getting are of the kind that should be expected. 
By an objective test the teacher can also discover 
whether a certain method of teaching is proving effective. 
She can learn this by giving at the beginning of the ex- 
periment a test of intelligence and a test of achievement in 
the given study, and another test of achievement in the 
study at the end of the experiment. The intelligence test 
will show whether the pupils are of a type to make slow, 
average, or rapid progress. The difference between the 



STANDARDIZED TESTS 

scores on the first and second achievement tests will show 
how much progress, with her help, they have actually 
made in this study. If it is less than it should be, other 
teaching methods may be tried; if it is more than most 
children of this ability make, the teaching-method may be 
retained and tried again, and if again vindicated may 
finally be accepted with complete confidence as a good 
scheme for teaching that subject. This experiment cannot 
be carried on by means of the old-fashioned tests, because 
the teacher can never be sure that the two tests used at the 
beginning and end of the experiment are of exactly equal 
difficulty, or that she has graded both tests of papers with 
exactly the same degree of strictness. Standardized tests 
are so constructed that answers are either right or wrong: 
they will therefore be scored in just the same way at all 
times. And the two tests of such a pair as would be used 
at the beginning and end of this experiment are known to 
be of equal difficulty because they have been tried out with 
hundreds of children before being published. 

To be able to see whether pupils are up to standard in 
each of their studies, to be able to determine their bright- 
ness as compared with other children, and to be able to try 
out teaching plans by the method of scientific experiment, 
are tremendous advantages. In these ways the teacher can 
learn what kind of material she has to work with, how 
well it has so far been worked with by herself and her 
predecessors, and to what extent her pet schemes of teach- 
ing are bringing results. These three uses of standardized 
tests, if actually in effect, would completely transform the 
average school. They are, however, by no means the only 
uses to which such tests can be put. 

Standardized tests assist one in learning not only how 
rapidly the pupils are progressing, but why certain of them 
do not make progress as rapidly as they should. In other 
words, standardized tests can be used for diagnosis of 

6 



THE FUNCTIONS OF TESTS 

mental difficulties, just as a clinical thermometer can be 
used by a physician for diagnosis of physiological difficul- 
ties. A test in addition, let us say, is devised in such a 
way that the half-dozen abilities which go to make up 
what we call the ability to add are tested separately. Thus 
we may have distinct tests for knowledge of the table, add- 
ing to a partial sum, bridging the tens, carrying from 
column to column, etc. By diagnostic tests of this sort the 
teacher is assisted in learning why a child cannot add, or 
subtract, or read, or write, and discovers exactly where he 
most needs help. Most of our instruction is of the hit- 
and-miss type, hitting the pupil's difficulty just occasionally 
and by happy accident. In medicine only the quack doctor 
gives medicine without finding the cause of the illness. In 
education, we nearly all do it. But by the help of diag- 
nostic tests the teacher now has an opportunity to base her 
instruction on exact knowledge of each pupil's difficulties. 

Standardized tests also show the pupil where he stands 
with reference to other pupils. He can compare his score 
either with that of other pupils in his room or with the 
average in America. The effect is excellent. He comes 
to feel that he is no longer working to meet the half- 
understood and half-accepted standard of the teacher, but 
is working to do as well as other pupils of his age or 
grade. This to him is definite and reasonable. He will 
spend hours in practicing with a football or baseball to 
be as proficient as other boys of his age. Toward fixed 
and intelligible standards in his studies his attitude is 
much the same. To be as good in a thing as the "other 
kids," especially if that thing is something which can be 
mastered by assiduous practice, is a motive that will arouse 
to strenuous exertion many a child who is now kiUing 
time. 

Psychologists keep telling us that, given a pupil of a 
certain degree of native ability, the principal factor deter- 
mining his rate of learning will be his resolution to learn, 



STANDARDIZED TESTS 

his purpose to learn. Good teaching then requires, as its 
first factor, abihty to arouse and maintain the purpose 
to learn. One of the best means of keeping fresh the 
purpose to learn is furnishing the pupil with definite 
objectives. The scores on standardized tests supply to the 
pupil's goal just this definiteness. When such scores are 
represented by a simple graph, say with one line showing 
the given pupil's attainment in these tests and another line 
the attainment of the average American child of this age 
or grade who has taken these tests, then the pupil has his 
strong and weak points set before him in a manner that 
is perfectly definite and objective. 

The teacher will also find that by means of standard- 
ized tests she can very greatly increase the accuracy of 
her rating of the achievement of pupils. Our present 
methods of measuring the achievement of children in 
their studies are most regrettably defective. When teachers 
are asked to grade their papers a second time they some- 
times miss their first mark by as much as fifteen or 
twenty per cent. A considerable number of teachers who 
tried this experiment in freshman classes at the University 
of Wisconsin found these extremes, and found that in 
general the amount of difiference between the first and 
second marks averaged about five and a half per cent. 
Yet pupils glory in beating a rival by one or two per 
cent, and teachers debate with themselves whether to give 
a paper 88% or 89% — neither realizing that if the paper 
were graded again in a few days its mark would probably 
dififer from this by some five per cent. When several 
teachers grade the same paper, the variations are still 
wider. A facsimile of a student's paper in an examination 
in geometry, graded by the principal teacher of mathemat- 
ics in each of 110 accredited high schools in the Middle 
West, received marks ranging from 28% to 92% ; and 
all the way from about 45% up to about 85% (except just 
at the passing mark) the marks were spread quite evenly. 



THE FUNCTIONS OF TESTS 

The distribution of high and low marks within a class 
is equally erratic. Certain teachers give from three to ten 
times as many high marks as others in the same school. 
In a study of the office records for a considerable number 
of years in one institution, ''A's" were actually found to 
be thirty-five times as common in one subject as in 
another, and "failures" to vary from an average of 33% 
in one subject to zero in another. Differences of this kind 
are now known to be the rule and not the exception.* 

Are such marks likely to serve long in arousing pupils 
to effort? It is only too evident to pupils and students 
that marks do not depend upon achievement. Is it any 
wonder, then, that the relation between teacher and stu- 
dent, in the upper schools, is so commonly shot through 
with hypocrisy, and that it is so extremely difficult to 
establish normal social intercourse in the class-room? The 
students are students of the marking basis of their new 
teacher — of his or her whims and foibles and hobbies 
and pet aversions, and they ''strive to please." Never 
yet has the discovery of truth proceeded successfully in 
any such atmosphere. And never will the desire for 
achievement appeal to those in the class most needing 
this motive until achievement is more fairly and more 
accurately measured. 

Standardized tests do this. Standardized tests, because 
their scoring plan is standardized, are independent of the 
peculiarities of the person scoring them. All answers in 
the best of these tests are either right or wrong, and the 
score is the same whether the number of correct answers 
is counted up by one teacher or another, or by a clerk. 
Ir this way the pupil is credited with exactly the amount 
accomplished. When this amount is compared with the 
amount accomplished by pupils elsewhere, or by other 
pupils of the same ability and training in this school, 

* On the inaccuracy of school marks, see Monroe, Measuring the Results 
of Teaching, Chapter 1 ; or Finkelstein, Teachers' Marks; or Rugg, Teachers' 
Marks, in Educational Administration and Supervision, Vol. 1. 

9 



STANDARDIZED TESTS 

marks or grades can be given accurately and dependably. 
Whether a pupil is ready for the next grade or the next 
course in the subject is then really known, not left to 
subjective estimate. And if tests of general intelligence 
are used in connection with tests of school achievement, 
pupils may be graded not always on a comparison of their 
achievement with that of other pupils of their age, but 
sometimes on a comparison of their achievement with 
that of other pupils of their degree of native ability. 
When marks are given according to the ratio of achieve- 
ment to ability, dull pupils will no longer be subjected to 
proddings and ridicule in school and to floggings at home, 
all for failure to do work which they are really unable to 
do; nor will bright pupils need to be allowed to idle away 
their time in school and acquire habits of indolence which 
will handicap them throughout life: all children may be 
asked simply to keep their achievement up to that of the 
average child of that capacity. 

By means of standardized tests, teachers as well as 
pupils may be rated more justly. A teacher's rating for 
efficiency ought to depend upon the progress of her pupils, 
due account being taken of the pupils' native ability. Given 
a class of a certain average ability, the teacher shows her 
skill by the difference in scores made on standardized tests 
by the pupils at the beginning and at the end of their 
stay with her. When lawyers and carpenters and engi- 
neers are judged on the basis of results secured, should a 
teacher continue to be judged on the basis of her ''pres- 
ence," or her voice, or her handwriting, or her use of 
methods favored by her superior? Is it not infinitely 
preferable that efficiency be proved by an exhibition of 
results secured? By standardized tests results are made 
tangible and measurable and are removed from the realm 
of debate. Measured results will make possible a genuine 
merit system of teacher promotion, with all the stimula- 
tion to eflfective lesson planning which that implies. It is 

10 



t 



THE FUNCTIONS OF TESTS 

doubtful whether we have well-standardized tests in a 
sufficient variety of subjects to make this plan prac- 
ticable at present. But when it comes it will certainly be 
the most effective single influence for the improvement of 
skill in teaching. 

A teacher taking charge of a new room will find stand- 
ardized tests useful in informing her of the educational 
status of her new pupils. By knowing this she can avoid 
re-teaching some things that the pupils already know, and 
teaching others for which they are not yet prepared. 
Exactly how much a class knows about a given topic as 
compared with the standard for that grade can be learned 
by using tests which have been given to thousands of 
other children in this grade, and which have scoring 
devices that make the scores independent of one's inevit- 
ably varying conceptions of scholarship. 

The principal will find standardized tests valuable in 
correctly placing pupils who have been transferred from 
other schools. Very commonly such pupils are put back a 
grade for safety's sake. This injustice to the child can be 
avoided by using tests carrying with them a standard for 
each subject in each grade. 

School officials also find advantage in using standard- 
ized-test scores as a medium for informing the public of 
the progress and the needs of the schools. Very frequent- 
ly the public cannot understand what the schools are try- 
ing to do and the schools cannot tell them. Test scores 
furnish the common language, for anyone can under- 
stand what is meant by saying that our schools in Smith- 
ville are a year ahead of most schools of America in, say, 
arithmetic, and a year or two years behind others in music 
or French or manual training. The need of expansion of 
equipment for certain lines of work then becomes plain. 
School officials have also used standardized tests to refute 
unwarranted attacks on a certain school system by those 
who were politically interested in securing a change of 

11 



STANDARDIZED TESTS 

administration. And, of course, modern school "surveys" 
draw no conclusions about a school system without found- 
ing opinion on measured results. 

Foreign countries regard the measuring movement as 
the most distinctive and significant feature of American 
education. Emissaries are sent to America to study it. 
Other countries are beginning to utilize the methods 
worked out here, and standardized tests are now in lim- 
ited use in Western Europe, India, China, Hawaii, and 
Australia. 

We ought to remind ourselves that the use of scientific 
method — the production of work that is precise, objective, 
impartial, and verifiable — has in its apphcation brought all 
the comforts and conveniences of modern civilization. 
Scientific method has, for example, completely transformed 
agriculture in one century, in this time making greater 
changes in that industry than were effected in the preced- 
ing fifty centuries. Through its use of scientific method 
education seems likely to undergo the same transforma- 
tion. Present school plans may in a short time seem as 
antiquated as the use of the sickle and flail in farming 
do now. 

It behooves the teacher to become familiar with stand- 
ardized educational measurements, both because of their 
probable future influence and because of the direct assist- 
ance they can render now in solving everyday classroom 
problems. 



12 



t 



^ 



CHAPTER II 

THE DIFFERENT KINDS OF TESTS 

The principal kinds of standardized tests may be shown 
in outhne form thus : 

I. Intelligence or mental-alertness tests. 

1. Individual tests. 

2. Group tests. 

(a) Tests of abstract intelligence. 

(b) Tests of mechanical intelligence. 

(c) Tests of social intelligence. 

II. Achievement tests. 

1. Research tests. 

(a) Primarily for comparison. 

(b) Primarily for diagnosis. 

(c) Primarily for prognosis. 

2. Practice tests. 

3. Teacher-made objective tests. 

III. Intelligence-achievement synthetic scales. 

IV. Miscellaneous. 

1. Scales measuring will-temperament. 

2. Scales measuring growth in religion. 

3. Scales measuring habits of good citizen- 
ship, etc. 

I. Intelligence tests. The intelligence or mental- 
alertness tests are tests of native mental ability, or at 
least of such ability apart from the direct influence of 
schooling. They are presumed to show, in their common- 

13 



STANDARDIZED TESTS 

est form, the ability of a child to profit from his schooling. 
Individual intelligence tests are tests which much be given 
to one child at a time. The best-known example is the 
Binet-Simon test, which, in various revisions, is used in 
most large-city school systems for the discovery of "sub- 
normal" or feebleminded children. Group intelligence 
tests are tests which can be given to a whole group at 
one time. The best-known of these is the Army Alpha 
test, devised during the war for the classification of the 
American recruits ; but we now have group tests for almost 
all ages of children and for all degrees of intelligence in 
adults. 

Tests of abstract intelligence are tests which measure 
one's mental ability by measuring one's ability to deal with 
abstract symbols, such as letters or words or numbers. 
The reality itself, the actual object, is not present — only 
the symbol representing it, such as its name or some letter 
or number to stand for it. For those not ready in dealing 
with abstractions, we are now beginning to have tests built 
from other sorts of material. Tests of mechanical intelli- 
gence are tests of one's ability to deal with machines. The 
test usually consists of a number of pieces of machinery 
taken apart, the requirement being to put them together 
again. The simplest bit of machinery may be nothing 
more than a nut and a bolt, the most intricate ones some- 
thing as complex as the parts of a clock; and there are 
all grades of difficulty between. 

Besides tests of abstract intelligence, or the ability to 
deal successfully with symbols, and tests of mechanical in- 
telligence, or the ability to deal successfully with machines, 
there probably ought to be tests of social intelligence, or 
the ability to deal successfully with people. There are 
individuals who, it seems, learn but little from books, and 
who have no special aptitude for machinery, who can suc- 
ceed as salesmen or business executives or in similar work 
whose primary demand is the ability to understand and 

14 



THE KINDS OF TESTS 

direct other people. These tests, however, are only in 
their first stages. 

II. Achievement tests. The achievement tests 
measure, not the ability to learn, but the amount that has 
been learned. They are concerned with the pupil's pro- 
ficiency in his school studies. Practically all of these are 
group tests. The research tests are those whose primary 
function is measurement, while the practice tests are those 
which measure results only incidentally and whose primary 
function is the improvement of results ; that is, the practice 
tests are in their first intention teaching devices. Practice 
tests in arithmetic, for example, are skillfully graded out- 
lines of "seat work," allowing each pupil to go forward 
at his own rate in mastering the four basic operations, 
and permitting him to prove his mastery of them by pass- 
ing increasingly difficult tests. Of the research tests, some 
are designed primarily for the comparison of one school 
with another or with a standard of attainment, or for the 
comparison of the proficiency of pupils at one date with 
their proficiency at an earlier or a later date. Others are 
designed primarly for diagnosis, that is, for finding the 
reason that certain pupils do not learn. They test the 
pupil's ability in each part of the operation or topic sep- 
arately, and attempt to discover by such analysis just 
where his difficulty lies. Others of the research tests are 
designed for prognosis, or forecasting the pupil's ability 
in a certain study. 

The teacher-made objective tests — which are too new 
to have been adequately named — are statements about the 
material recently covered in any course, made up so that 
the answers desired can be indicated by underlining a 
word or by some other simple method which will make 
all the answers either right or wrong. One type of such 
a test shows four or more possible ways of completing 
each statement, and the pupil's task is to underline the 
one word which completes the statement correctly. For 

15 



STANDARDIZED TESTS 

example, "Chicago is in (Ohio, Indiana, Wisconsin, IIH- 
nois, Missouri)." Here the pupil's knowledge would be 
shown by underlining the word Illinois. Another type con- 
sists of a large number of statements so selected that about 
half of them are true and about half are false. The pupil 
shows his knowledge of the subject by writing the word 
true or the word false before each statement. These 
"home-made" objective tests are standardized in the sense 
that the scoring of them is done by a standardized plan 
which makes the result the same for all persons scoring 
the paper, but they are not standardized in the sense of 
having standard scores for comparison. 

III. Synthetic scales. The intelligence-achievement 
synthetic scales are combinations of mental-alertness and 
school-attainment tests which allow one to measure at the 
same time the native ability and the school proficiency of a 
group of pupils — to get, at any rate, a general survey of 
the group. The Illinois Examination, one of the best- 
known of these, measures intelligence, silent-reading ability, 
and ability in the four fundamental operations in arithmetic. 
Such composite tests are frequently used as the basis for 
special promotions or for reclassification of pupils. 

IV. Miscellaneous. Tests of will-temperament at- 
tempt to measure the so-called dynamic traits of person- 
ality — the endowment other than intelligence which makes 
for success. They are bringing useful results in certain 
schools training for business careers, where they assist 
in deciding the particular kind of business to which a 
given student may be best adapted. The other tests of 
this group are hardly far enough advanced to be con- 
sidered reliable or really standardized. 

The dependable and important tests at present are the 
tests of intelligence or mental alertness and the tests of 
school achievement. 



16 



CHAPTER III 

TESTS OF ABILITY TO LEARN 

The Meaning of General Intelligence 

Intelligence has been variously defined as the ability 
to learn, the ability to carry on abstract thinking, the 
ability to adapt oneself to new situations, the ability to 
use one's mental powers in a productive way. The latter 
definition is intended to include something more than 
brightness or mental alertness, for, as its author points 
out, one may be bright and alert and yet not manage 
his afifairs well. Such a person may lack mental balance, 
or he may lack the power to resist suggestion, or the power 
to see any but commonplace relations among experiences, 
or the power to see which, among a great mass of facts 
presumed to bear on a question, is the single significant 
fact. Admitting that such are the characteristics of the 
superior individual and that they are of the first import- 
ance in practical and in scientific work, we may well ques- 
tion, however, whether they are the traits with which we 
are most concerned in school. For, much as we may 
regret to say so, the kind of intelligence most required 
for success in school work is brightness or alertness. 
Therefore, so long at any rate as we are primarily con- 
cerned with measuring intelligence for the purpose of pre- 
dicting school success, it is unnecessary to define intelli- 
gence so broadly as to include these traits, admirable as 
they are. 

Defining intelligence as the ability to adapt oneself to 
a new situation again emphasizes a trait which is not of 
outstanding importance in school work. In most schools 

17 



STANDARDIZED TESTS 

the major part of the adapting is done for the pupil by 
the teacher. Seldom is the child asked to go up against a 
really novel situation, for the introduction to the new type 
of problem is nearly always given him as a part of the 
instruction. Although the ability to deal with and adapt 
oneself to a new situation may be ever so desirable, it is 
not precisely the ability most needed for school success. 

Defining intelligence as the ability to carry on abstract 
thinking comes nearer to the type of intelligence now re- 
quired in school. But this somewhat overemphasizes the 
element of thinking. In perfecting skills, such as writing 
or reading, or adding, not a great deal of thinking is re- 
quired. Yet the child who, if he tries, can rapidly master 
these things would usually be called more intelligent than 
the child who cannot. Therefore to limit intelligence 
strictly to the ability to carry on thinking or reasoning 
would seem to make the definition a little too narrow. 

"AbiHty to learn" is probably the most satisfactory 
brief definition of intelligence for the teacher. It is ability 
to learn which must be taken into account in almost any 
kind of school experiment. It is exactly the ability upon 
which the work of the school depends. And it includes 
not only thinking as dealing with novel tasks, but the 
learning of school tasks of all sorts. Whether there is 
such a thing as general intelligence, and whether this ques- 
tion is an important one at present, will be discussed on 
page 23. 

Composition of the Tests 

Intelligence tests were first successfully worked out 
when, after the persons in charge of the schools of Paris 
had established special schools for subnormal children 
without providing a method for selecting such children, 
the psychologist Alfred Binet attempted to perfect devices 
by which subnormal children could with certainty be dis- 
covered. For it appeared to M. Binet a very serious 

18 



TESTS OF ABILITY TO LEARN 

thing to designate a child as feebleminded, and the cause 
of a great injustice if a mistake were to be made. Binet, 
with the assistance of a physician, Simon, worked on a 
new principle. Previously, many "mental tests" had been 
devised in psychological work, but they were usually tests 
of the simpler mental processes. They measured intelli- 
gence, if at all, only by measuring something supposed to 
be correlated with it. Binet began by attempting to meas- 
ure intelligence directly. He devised tests the solution of 
which required thinking and judgment. Furthermore, he 
used tests not singly, as heretofore, but in groups. Binet's 
tests really worked, in the sense that when they were given 
to children whose intelligence was already known by long 
association, they placed the children in correct order. It 
could then be assumed that they would rate children un- 
known to the examiner correctly in comparison with chil- 
dren of a known degree of intelligence. Measuring intelli- 
gence directly and by groups of tests was thus proved 
superior to measuring it indirectly and by single tests. 

Binet's other great contribution to mental measurement 
was to introduce the idea of age levels — the idea that at 
each age the average child is able to solve a certain num- 
ber or a certain type of problems, and that the intelligence 
of any given child can therefore be expressed in terms of 
the "mental age" which he is thus shown to have reached. 
A group of tests could then be standardized as the tests 
which should be passed by a child of a given age if his 
intelligence was average; and his acceleration or retarda- 
tion in mental growth could be expressed in terms of 
years. 

Binet published his first set of tests in 1905 and re- 
vised them in 1908 and again in 1911, and further revi- 
sions were made by Goddard, Terman, and others in 
America. Terman's Stanford Revision of the Binet Scale 
is now the most widely used intelligence scale for careful 
individual measurement, and is commonly regarded as a 

19 



STANDARDIZED TESTS 

remarkably accurate psychological instrument. It consists 
of ninety tests, with six or eight at each age level from 
three years to sixteen years. For the youngest children, 
the tests are of such simple things as pointing to the nose, 
eyes, etc.; naming familiar objects, such as a knife, key, 
penny; telling whether a boy or girl, and so on. Older 
children are asked to define such abstract words as pity, 
revenge, charity; to fill out words in dissected sentences; 
to discover the meaning of fables, and so on. The stand- 
ards for this scale are the results of very careful work 
with thousands of children. Its use should, however, be 
left to the expert, since accurate and reliable results can 
be secured only after the examiner has undergone con- 
siderable training. So much for individual tests. 

The group tests of intelligence are one of the out- 
comes of the war. Since it was desirable to test the in- 
telligence of all recruits entering the army, and since this 
was obviously impossible if the men were taken one at 
a time, as the Binet method required, it was necessary to 
devise a test which could be administered to large numbers 
simultaneously. Such a test was perfected by American 
psychologists during 1917 and 1918, was given to one 
million seven hundred thousand soldiers, and after the 
war was given to a very large number of high school and 
college students. Adaptations to younger children were 
then worked out, so that today we have group tests for 
every age from six to sixteen — the age at which mental 
maturity is presumed to have been reached. 

The group intelligence tests contain such problems as 
the following: The pupil may be given a paper covered 
with letters and geometrical figures and be asked to draw 
certain lines from one to the other, or to underline certain 
of them. This is a test of ability to carry out directions. 
The directions are given rather rapidly and must be ex- 
ecuted in a limited amount of time. A second test may 
consist simply of arithmetical problems, ranging in diffi- 

20 



TESTS OF ABILITY TO LEARN 

culty from such simple exercises as "How many are 30 
and 7 men?" to **If a man runs a hundred yards in ten 
seconds, how many feet does he run in a fifth of a 
second?" A third test may be a test of common sense, 
such as placing a cross before the best reason of the fol- 
lowing three: "Why do we use stoves? Because ( ) 
they look well, ( ) they keep us warm, ( ) they are 
black." Another test may ask the student to indicate 
whether a certain pair of words have the same or opposite 
meanings ; or to show the relationship between words by 
some such scheme as underlining the word in italics which 
bears the same relation to the third word that the second 
does to the first, in the following series: "Gun — shoots; 
knife — 7'uns, cuts, hat, bird." Or the test may ask the 
person examined to add the next two numbers to a series 
such as: "3, 6, 9, 12, 15, 18, — , — ." It may ask him 
to point out which four of a series of six pictures are 
alike in some way; e. g., they all refer to summer, or all 
to a certain kind of act. It may simply ask for items of 
general information, such as would be shown by underlin- 
ing one of the words in italics in this sentence: "The 
tuna is a kind of iish, bird, reptile, insect." Each of these 
tests usually begins with very easy questions and goes on 
gradually to very difficult ones. The time limit prevents 
anyone but a genius from finishing the test, and the scores 
are computed in terms of the number of exercises that 
have been completed correctly in the given time. 

The underlying principle of such testing is that we 
can get a fairly good measurement of a person's general 
intelligence by taking, as it were, samplings here and there 
of different kinds of ability. Whether this method actually 
measures intelligence can be determined only by selecting 
a certain number of persons to test and checking up the 
results against some slower and presumably more accurate 
measurement of intelligence, such as that secured through 
long acquaintance with the selected individuals. To what 

21 



STANDARDIZED TESTS 

extent the tests are thus vindicated is discussed below 
under 'Vahdity of the tests." It seems obvious that if 
a thirty-minute intelligence test puts a group of men in 
the same order as that in which they would be placed 
by persons well acquainted with them, then the officer 
can know at once the kind of men he has to drill and 
the teacher can know on the first day of a semester — not 
after several weeks — the amount of ability of each child 
whom she is to instruct that semester. 

Another principle on which such tests are built is that 
if intelligence means ability to learn, it can be measured 
not only by having the tested person learn something new 
during the test, but also by measuring the amount he has 
learned in the past. It is on this basis that the intelligence 
scales justify the use of tests of general information, and, 
to a certain extent, the tests involving arithmetical prob- 
lems. The items in such tests must be selected from 
sources of information or from types of training which 
are common to all the persons tested. They must not, of 
course, be drawn from special fields of learning or from 
special kinds of environment. In the Army tests they 
were apparently taken from items learned in the first four 
or five grades of the schools and from the reading matter 
and advertisements of newspapers. Similar considerations 
have controlled the making of intelligence tests for use in 
schools. 

Validity of Intelligence Tests 

If intelligence is taken as ability to learn, then the 
fact that the existing tests usually presuppose a certain 
educational background and a certain ability to deal with 
abstractions is no drawback as far as ordinary school uses 
of the tests are concerned. The tests may indeed be to a 
considerable extent linguistic, but linguistic ability is prob- 
ably the most important single factor in success in the 
present-day schools. We are ordinarily interested in learn- 

22 



TESTS. OF ABILITY TO LEARN 

ing whether given pupils have or have not the ability to 
master the present school tasks. Experimental evidence 
shows that the intelligence tests now available are able 
to measure this ability very well indeed. 

The fact that many men of proven ability failed in 
their school work — Ohver Goldsmith, Lord Byron, Charles 
Da,rwin, etc. — might imply either that the ability of such 
men is of a highly specialized type, or that these men 
found nothing in the schools of their day which appealed 
to them as worth while. The latter theory will explain 
such cases, but will not so readily explain the cases of 
boys, known to all of us, who do try to master their 
school tasks, without much result, and who later become 
successful mechanics and business executives. These per- 
sons seem to be endowed with special ability to deal with 
machines or with people, even though they cannot deal 
with anything abstract. Our ordinary intelligence tests do 
not measure such abilities. A special test has recently 
been designed for measuring mechanical aptitude,^ but the 
test for measuring social aptitude is still in the future. 
The existence of what appear to be these three types of 
intelligence — the abstract, the mechanical, and the social — 
may necessitate the reorganization of the schools to provide 
more adequately for the latter two, but it does not invalidate 
the work of the present intelligence tests in measuring the 
ability to master the present school curriculum. 

Another point a little difficult to keep clearly in mind 
is this : Saying that intelligence tests measure a pupil's 
ability to master school work is not saying that they 
measure the probability that he will master it. Purpose 
and effort count heavily, of course, and general intelli- 
gence tests should not be criticized for their failure to 
measure factors such as these, which they were not 
designed to measure. Intelligence tests will not tell us 
which pupils will succeed, but only which pupils will suc- 

* Stenguist Mechanical Aptitude Test. 

23 



STANDARDIZED TESTS 

ceed if they try. The latter is obviously a very valuable 
kind of information, since by means of it teachers and 
parents can with confidence put on pressure to make sure 
that indolent pupils do try, and can likewise lessen the 
pressure on pupils shown by the tests to be doing all they 
are able to do. 

That intelligence tests really do measure intelligence 
as well in one hour as it can be estimated after several 
months' acquaintance is proved by methods such as giving 
the tests to pupils whom the teacher knows well, and 
comparing the order in which the pupils are placed by the 
tests with the order in which they are placed by the 
teacher. For most of the children in the class the agree- 
ment will be remarkably close. If one then goes a step 
farther and studies very carefully the pupils about whom 
the teacher and the test do not agree, he can discover 
which of the two is the more dependable. Such experi- 
ments have been carried on in a number of cities and 
sometimes with large numbers of pupils. For example, 
in the junior high school of the University of Oregon^ 
one hundred and twenty-five pupils were given three dif- 
ferent intelligence tests and at the same time were ranked 
as to their intelligence by six of their teachers inde- 
pendently. The teachers' estimate of intelligence was 
made out in this case with a great deal of care, A study 
of the cases regarding which the teachers and the tests 
did not agree then showed that the teachers tended to 
overestimate over-age pupils and pupils who were talkative 
and vivacious, and to underestimate the younger and physi- 
cally-undeveloped pupils or pupils who were shy and retir- 
ing, or else that they made no distinction between a pupil's 
intelligence and his proficiency in his studies. Studies 
made earlier by Binet and by Terman had had the same 
outcome.^ Such experiments would seem to show that 

* Ruch, Study of Mental, Pedagogical, and Physical Development, Univer- 
sity of Oregon Publications, No. 7. 

3 Terman : Measurement of Intelligence, Chapter 2. 

24 



TESTS OF ABILITY TO LEARN 

the scores on a brief intelligence test are often even more 
dependable than the pooled opinion of several teachers 
well acquainted with the children. 

Intelligence tests can also be checked, though less satis- 
factorily, by comparing their results with school marks. 
When this was done, for example, in the High School 
of Leavenworth, Kansas,* the agreement was found to be 
especially close in such abstract subjects as Latin and 
algebra. Forty-five cases in which test scores and school 
marks did not agree were then investigated individually, 
and in all except three cases the low marks of these pupils 
were found to result from poor health, indolence, irregular 
attendance, or some other factor besides low intelligence, 
and the high marks from such influences as excellence of 
attitude in class or exceptional effort. In other words, the 
tests were shown as before to be the more accurate meas- 
urements. Similar experiments brought similar results 
among the students of Brown University^ and Smith 
College. 

Intelligence tests were checked in the Army by com- 
paring their rankings of the men of a company with the 
rankings given by an officer who had known the men for 
several months. Thus, in one group of over seven hun- 
dred men, whose officers were asked to rank them as to 
"practical soldier value," there was substantial agreement 
between officers' rating and test-rating in 88% of the 
cases. Considering the number of factors which influence 
practical soldier-vahie besides intelligence, this seems re- 
markably close. About the same amount of agreement was 
discovered independently in several other camps, in experi- 
ments each of which involved several hundred men. 

Intelligence tests can also be checked by seeing whether 
the highest scores are made by persons in occupations 

•* Bright : Intelligence Examination of High School Freshmen. Journal 
of Educational Research, June, 1921. 

5 Colvin in Educational Review, June, 1920 ; School and Society, July 5, 
1919, and July 29, 1922. 

25 



STANDARDIZED TESTS 

which are commonly supposed to require the highest intel- 
ligence. When army-test scores were analyzed according 
to the occupations of the men taking the test, the order 
of occupations by size of score was this : Professions, 
clerical occupations, trades, partially skilled labor, unskilled 
labor. This is the order of intelligence in which any 
schoolman would put the occupations if intelligence is 
taken in the sense of the school ability of the boys who 
later enter these occupations. Similar experiments at 
Leland Stanford University, using from thirty to two 
hundred representatives of each occupation, placed the 
groups in the following order: College students (future 
followers of the professions), business men, express em- 
ployes, motormen and conductors, firemen and policemen, 
salesgirls. More work of this type, to supplement the army 
results, is very much needed. 

The best way, however, to check the validity of intelli- 
gence tests for school use is to try them out in the schools. 
Their commonest use is for purposes of classification of 
pupils into groups of uniform ability. The plan is in 
effect, for example, in the Harrison Technical High 
School, Chicago; the University of Minnesota High 
School, and in the high schools of Montclair, New Jersey; 
Long Beach, California, and Oakland, CaHfornia. From 
each of these schools the experiment is reported a success. 
Some of the schools have stated that they could not be 
induced to return to the old haphazard method of classify- 
ing pupils. As to the reliability of scores, the group tests 
are thus proved to give results which at the least are 
sufficiently accurate to efifect a very great advance over 
existing methods of classification. 

The score on a single group test of intelligence should 
not be depended upon, however, where a decision is to 
be made about a single individual which will greatly in- 
fluence his future. To decide, for example, whether a 
certain pupil is to be considered feeble-minded and sent 

26 



TESTS OF ABILITY TO LEARN 

to a special room or a special school, an individual test 
should be used. And for making less momentous but 
still important decisions about a single individual, the re- 
sults of two or more group tests should be combined. 
The group-test score for each individual should always be 
considered an approximation. In measuring the intelli- 
gence of a collection of individuals, such as those of a 
school or a room as a whole, the group tests are much 
more thoroughly dependable, for with a large number of 
measurements the small errors in each measurement tend 
to balance and cancel each other. The average or median 
scores resulting from the use of a group intelligence scale 
may therefore be accepted as accurate, while the score 
of each individual should be considered a rough measure 
subject to some correction from other intelHgence tests 
which may be given later. 

"Do not speed tests, or tests with a time limit, penalize 
unjustly the person who thinks slowly and accurately?" 
one is frequently asked. "If more time were given, would 
not the slow thinker often prove the most intelligent of 
all?" The evidence we have so far secured seems to prove 
the opposite. When the work of persons who have 
answered only a few questions of the test is compared 
with the work of persons who have covered much more 
ground, the former are discovered to have made more 
mistakes than the latter. This is true in spite of the fact 
that the slow worker may have finished so little of the 
work that he had perhaps only half as many chances to 
make a mistake. As to extending the time, this was tried 
out pretty thoroughly in the Army. In one camp 123 
men, in another 387, and in another 510 men were given 
the intelligence test, first with the usual time limits and 
then with the time doubled. The ranks given the men 
by the two methods of testing them were almost exactly 

27 



STANDARDIZED TESTS 

the same.^ In fact, this was one of the closest confirma- 
tions ever found in work with intelligence tests. 

The Principal Uses of Intelligence Tests 

The bright children of each grade may be placed to- 
gether in one room, as explained before, the mediocre 
children in another, and the dull children in another. The 
bright children may then either be helped to cover the 
required work more rapidly and to finish their course more 
quickly, or they may be put through the course in the 
same number of years as others, but be given an enriched 
curriculum each year. Similarly, the dull pupils may be 
allowed either to cover the regular course in a longer time 
or to finish in the regular time with minimum essentials 
only. 

The latter plan is in effect, as one instance, in the 
elementary schools of Detroit. There the children are 
divided upon first entering school into a bright section com- 
posed of the upper 20%, a medium section of the middle 
60%, and a dull section of the lower 20%, who are given 
respectively the enriched course of study, the regular 
course, and the simplified course. Although the test classi- 
fication is considered tentative, and shifting pupils from 
one group to another is permitted, very little shifting has 
been necessary.'^ 

Uniform classification permits adaptation of the teach- 
ing method to the ability of the pupils. For example, 
more time can be given to drill and to review in the class 
of dull pupils, and this cuts down the number of failures. 
It gives some of the dull pupils their first chance to de- 
velop as leaders in the classroom. It eliminates, according 
to many teachers who have tried it, a large number of 
disciplinary difficulties — for the dull pupil can understand 

8 Psychological Examining in the United States, Part II, Chapter 9. Offi- 
cial Report to the Surgeon General. 

^ For further details of the working of the plan, see the Twenty-first Year- 
book of the National Society for the Study of Education. 

28 



TESTS OF ABILITY TO LEARN 

what is going on, and the bright pupil is not obliged to 
kill time. It gives some of the bright pupils their first 
taste of real competition. It thus prevents the formation 
of injurious and tenacious habits of indolence, and makes 
it possible to develop to their full capacity those on whom 
the country must depend for its thinkers in every line of 
endeavor. 

Intelligence tests can be used, even where classification 
is impossible, as a basis of school marks. Many schools 
are beginning to give marks and promote pupils, not on the 
basis of what the pupil accomplishes as compared with 
other pupils, but on the basis of what he accomplishes as 
compared with his ability. Ability is measured, of course, 
by intelligence tests, and achievement by achievement or 
subject-matter tests. This plan is found to bring real 
effort from almost everyone and to overstrain no one. 

Intelligence tests may also be used, in connection with 
achievement tests, in deciding upon the efficiency of a given 
piece of teaching. The intelligence tests tell what the 
pupils are capable of doing, the achievement tests tell what 
they have actually done. A good deal of light can be 
thrown in this way upon the value of different methods 
or devices in teaching. 

In high schools and colleges, intelligence tests may 
be used in determining which students are to be permitted 
to carry extra courses. The older method, which makes 
this decision depend upon average scholarship in the pre- 
vious semester, often induces the more ambitious but less 
hardy students — especially adolescent girls — to overwork, 
and bring permanent injury to their health. Intelligence 
measurements permit extra work to be carried only by 
those who are bright enough to do it without injustice to 
themselves. Besides determining the amount of work, the 
tests may also determine the kind of work a student should 
attempt. That students below a certain standing on the 
intelligence scale will almost certainly fail in the more 

29 



STANDARDIZED TESTS 

abstract subjects, such as algebra and Latin, we know 
because it has been observed that practically all students 
below this degree of intelligence do fail. Unless we be- 
lieve very thoroughly in the superiority of the abstract 
type of intelligence, this is not necessarily a reflection on 
the student, and in advising him away from those studies 
we need not present it as such. 

Both in the junior and in the senior high schools, the 
intelligence tests are useful as a basis for advice in the 
choice of a vocation. Although we cannot tell a boy the 
occupation in which he is pretty sure to succeed — for this 
depends largely upon temperamental and emotional traits — 
we can tell him many of the occupations in which he is 
practically certain to fail; for each occupation requires a 
certain minimum degree of intelligence. The professions, 
for example, cannot be entered without graduation from 
high school and from college, and the degree of intelli- 
gence for graduation from college, as well as the degree 
of intelligence for graduation from high school, is pretty 
well known in terms of scores on intelligence tests 
taken very much earlier. We know in a rough way the de- 
gree of intelligence characteristic of each of the occupa- 
tions reported by the two million men examined in the 
United States Army. The occupations whose mem- 
bers made the lowest average scores were those of : 
laborer, general miner, teamster, barber; those in the next 
group were: horseshoer, bricklayer, cook, baker, painter, 
general blacksmith, general carpenter, butcher, general ma- 
chinist, hand riveter, telephone and telegraph linesman, 
general pipefitter, plumber, tool and gauge maker, gvm- 
smith, general mechanic, general auto repairman, auto 
engine mechanic, auto assembler, ship carpenter, telephone 
operator; those in the next higher group were: concrete 
construction foreman, stock-keeper, photographer, teleg- 
rapher, railroad clerk, filing clerk, general clerk, army 
nurse, bookkeeper. The next higher group included: 

30 



TESTS OF ABILITY TO LEARN 

dental officer, mechanical draftsman, accountant, civil en- 
gineer, medical officer. The highest group included army 
chaplains and engineer officers. It is true that vocations 
are now very commonly taken up for reasons other than 
one's special fitness for them, and thus that many of the 
men in these occupations do not really belong in them. 
But a considerable number of bad choices are remedied by 
shifting about from one occupation to another, and the 
above Ust may be taken as furnishing at least a rough 
indication of the amount of (abstract) intelligence charac- 
teristic of the large groups of occupations. On this basis 
some advice can be given, particularly as to the vocations 
that ought not be attempted. It is important that school 
people give up their present inclination to advise all pupils 
to be ambitious enough to try to climb into the more 
intellectual occupations. This advice is injurious both to 
the boy who may be thus led into a failure in life and 
to the clients or patients whom he may but half com- 
petently serve. If occupations are considered respectable 
and admirable in the degree to which they meet real and 
wholesome needs among the people they serve, and not, 
as at present, in the degree to which they involve the use 
of the brain rather than the hand, then the whole basis 
of vocational advice will be changed. Intelligence tests 
may be made to assist very greatly in placing the pupil 
where he will render his best service. 

In colleges and universities intelligence tests are coming 
to be used as at least a partial basis for determining 
admission. They are so appHed, for example, at Columbia 
University, the Carnegie Institute of Technology, and the 
University of Michigan. They are also used in advising 
a student in choice of studies, as at the School of Com- 
merce and Administration of the University of Chicago. 
Sometimes they are used in deciding whether a student 
is to be asked to withdraw from the university, for it is 
obvious that if his low grades in first-semester work are 

31 



STANDARDIZED TESTS 

coupled with very low intelligence scores it is useless to 
ask him to stay longer, while if they appear in connection 
with very good intelligence scores, the cause of the low 
grades may be something remediable, such as inadequate 
preparation or poor study habits. In certain colleges in- 
telligence scores are used as a basis for grades, and the 
brighter students are compelled to keep their attainment 
up to their ability. In at least one college, intelligence 
tests are likely to serve as the ground for the award of 
a very large number of scholarships. The general effect 
of the measurement of intelligence in colleges seems to 
be the improvement of the morale of the student body, for 
the students come to feel that the faculty is not so much 
engaged in exacting work from them for a diploma as 
in guiding and helping them to make the most of their 
abilities. 

Outside of schools, tests of intelligence are coming to 
be used in connection with such industrial and social 
questions as immigration, vocational placement, and the 
treatment of criminals. In deciding whether an immigrant 
is to be admitted to this country, it seems more sensible 
to test his intelligence than to test his degree of literacy. 
For illiteracy may be an accidental handicap and certainly 
is a remediable one, while low intelligence is not only a 
permanent characteristic of the individual, but will prob- 
ably be a characteristic of a large number of his offspring 
for all time. Those of low-grade intelligence, not the 
illiterates, are really the least desirable citizens. 

For vocational placement the tests of intelligence 
already are widely used, and by the employment depart- 
ments of many large companies they are considered stand- 
ard equipment. They are used both to determine whether 
an applicant shall be accepted, and, if accepted, to deter- 
mine for which kind of work he shall be trained. To 
place a man in a job in which he will be contented, and 
thus to reduce the labor turnover, is rapidly coming to 

32 



^ 



TESTS OF ABILITY TO LEARN 

be thought of as good economy. Experiments in industrial 
estabHshments have shown that there is a very close rela- 
tionship between one's liking for his work and his intel- 
lectual adjustment to it. Neither able men in routine work 
nor dull men in highly organized work report themselves 
as liking their jobs. On the other hand, both able men 
in difficult work and stupid men in simple mechanical work 
report that their work is to their liking. Work adapted 
to the ability of the individual, other things equal, means 
a longer stay at the job and a lessened cost in breaking 
in new men. 

From the point of view of the employee, as well as 
from that of the employer, intelligence tests for job place- 
ment are of great value. There is no more real contribu- 
tion to one's happiness than to spend the eight or ten 
hours of one's working day in activity which one enjoys. 
The working hours make up the largest continuous section 
of one's waking life. School people must come to see 
that to assist a pupil to choose a congenial life work is 
the greatest single benefit they can confer on him. Intel- 
ligence scores, although by no means the only criterion in 
selecting a vocation, are yet one of the most important of 
the criteria. 

Criminology makes increasing use of intelligence tests 
in deciding responsibility for infractions of the law. It is 
to be hoped that it will soon be possible to do away 
with the present common method of dealing with cases 
of doubtful responsibility, that is, the giving of a com- 
promise sentence. If the criminal is feeble-minded he 
should indeed be placed where he is not likely to commit 
another crime, but he should not be punished. If he is 
of normal intelligence and capable of understanding the 
effects of his acts, then he should be punished so severely 
as to deter others from crime. The compromise sentence 
defeats both these objectives. 

Such are some of the uses of the intelligence tests. 

33 



STANDARDIZED TESTS 

Their abuses, unfortunately, would furnish an almost 
equally long list. The testing movement has grown so 
rapidly that its greatest danger now is in its over- 
enthusiastic friends. An intelligence test is not a panacea 
for all school ills; neither is it fool-proof. It is necessary 
that users of intelligence tests know something of what 
is being measured, and something of the interpretations 
that may or may not be put upon the resulting scores, 
or these scores are sure to be misapplied. It is well to 
remember how rough a single measurement of an in- 
dividual is. And it is well, in making applications of the 
results, to remember that the kind of intelligence test in 
most common use probably is not a general intelligence 
test at all, but a test of one particular type of intelligence, 
the academic type; and that, therefore, however well the 
tests acquit themselves in applications which lie within 
the school, they must at any rate be used very much more 
cautiously in applications which reach outside the school. 
The future influence of perfected scales for measuring 
intelligence is something we cannot yet imagine. It is 
sometimes said that our manner of life was completely 
revolutionized in the nineteenth century by the utilization 
of power machinery, and that it will again be revolution- 
ized in the twentieth century by the utilization of air 
transportation. But it is possible that intelligence tests 
may bring the race more real progress than either of these, 
because they will serve to place in strategic positions the 
men most capable of using the opportunities for progress 
which lie there. They will put able men where ability will 
count. Advancement can thus be tremendously accelerated. 

Selection of an Intelligence Test 

There are a few very simple criteria to keep in mind 
in choosing an intelligence test for a given experiment, 
and others not so simple. Of course, one should consider 
the pupil-age for which the test was designed, in order 

34 



TESTS OF ABILITY TO LEARN 

that he may not perhaps secure for upper grades a test 
designed for primary children or for college students. He 
should also consider the time taken to administer the test, 
with reference to the length of his class period, or to the 
possibility of extending the period. He should consider 
the length of time needed for scoring the papers. Many 
tests cut this down by special devices, such as transparent 
answer sheets to be laid over the pupil's answer list. He 
should, of course, consider money cost, though the amount 
of difference between the various tests in this matter is 
not ordinarily great enough to make this a very important 
item. 

Less simple considerations are the extent to which the 
given test has been used and the degree to which its results 
check up with other tests or with a repetition of the same 
test. The extent to which a test has been used shows 
something of the number of cases upon which its mental- 
age norms are based, and shows in a very rough way 
how effective it has proved under actual schoolroom con- 
ditions. But this is, of course, an unfair criterion for 
the newer tests, and the newer ones are often best because 
their authors have had an opportunity to profit from 
pioneer mistakes. Furthermore, extent of use may depend 
more directly upon vigor of advertising by the retailing 
company than upon merit. The degree to which one in- 
telligence test agrees with others, when four or five are 
given to the same children, and the degree to which a 
test agrees with a repetition of itself, are excellent checks. 
These can be found out by reading the author's announce- 
ment of the construction of his scale and the tests to 
which he has himself put it, and by reading the reports 
of persons who have subsequently procured and used the 
scale. These accounts will be found in the educational 
and psychological journals, particularly in the Journal of 
Educational Psychology, the Journal of Applied Psychol- 

35 



STANDARDIZED TESTS 

ogy, the Journal of Educational Research, the Elementary 
School Journal, the School Review and School and Society. 

Another important consideration is the purpose for 
which the test is being given. If the purpose is the pre- 
diction of scholarship, or more strictly of ability to master 
the present curriculum in case effort is made, then the 
relationship already established between school marks and 
scores on the given test is important. That is, if we wish 
to lind whether this test will measure the pupil's ability 
to learn what the schools now teach, it is important to 
see whether the ability it does measure has already been 
found by an examination of grades to be closely related 
to scholarship. Checks of this kind are desirable, for 
example, where the tests are to be used for college or 
high school entrance, for special promotions in the ele- 
mentary school, or for classification of pupils into groups 
of uniform ability. On the other hand, if the purpose of 
the experiment is vocational advice, then the relationship 
between test scores and scholarship is of less importance. 
If the purpose is to test for commitment to a special 
school or institution, the individual rather than the group 
tests should be selected. 

A list of all the standardized tests in print, both 
intelligence and achievement tests, is now available under 
the title, "Bibliography of Tests for Use in Schools," 
which can be procured for ten cents from the World 
Book Company. This booklet gives the name of the pub- 
lisher of each test, and in some cases the reference to 
the journal in which the test was announced and described 
by its author. The forty-six intelligence tests published 
before 1922 are listed in the Twenty-first Yearbook of the 
National Society for the Study of Education (published 
by the Public School Publishing Company). This list 
gives the title and author of each test or scale, the number 
and nature of the different tests included, the age of 
pupil to which the scale is adapted, the number of minutes 

36 



TESTS OF ABILITY TO LEARN 

needed for the test, the pubHsher and price, and the jour- 
nal in which the test is described. Information about the 
available tests can be secured by writing for the price 
lists and descriptive literature issued by the publishing 
houses especially interested in this field, some of which 
are: 

The Plymouth Press, Chicago, 111. 

The World Book Company, Chicago, 111., or Yonkers- 
on-Hudson, N. Y. 

The Public School Publishing Company, Bloomington, 
111. 

The C. H. Stoelting Company, Chicago, 111. 

The Bureau of Publication, Teachers' College, Colum- 
bia University, New York. 

Some of the best-known intelligence tests are the fol- 
lowing : 

For the kindergarten and the primary grades: The 
Detroit Kindergarten Test; Detroit First-grade Intelligence 
Test ; Haggerty Intelligence Examination, Delta I ; HoUey 
Picture Completion Test; Kingsbury Primary Group In- 
telligence Scale ; Otis Group Intelligence Scale, Primary 
Examination ; Pressey Mental Survey Tests, Primer Scale. 

For the intermediate and upper grades : Haggerty 
Intelligence Examination, Delta II ; Illinois General Intel- 
ligence Scale ; National Intelligence Tests ; Otis Group 
Intelligence Scale, Advanced Examination; Pintner Mental 
Survey Tests ; Pintner Non-Language Mental Tests ; Pres- 
sey Cross-Out Tests ; Whipple Group Test for Grammar 
Grades. 

For the high school : Army Alpha Intelligence Tests ; 
Miller Mental Ability Tests; Otis Group Intelligence 
Scale, Advanced Examination; Terman Group Tests of 
Mental Ability. 

For colleges and universities: Brown University Psy- 
chological Examination ; Roback Mentality Tests for Supe- 
rior Adults ; Rogers Group Tests of Intelligence ; Thorn- 

37 



STANDARDIZED TESTS 

dike Intelligence Examination for High School Graduates ; 
Thurstone Psychological Examination for College Fresh- 
men and High School Seniors. 

For all grades : The Bifiet-Simon Scale, and its revi- 
sions by Goddard, Yerkes, Kuhlman, and Terman; the 
Myers Mental Measure; the Trabue Mentimeter. 

Giving the Tests 

The person giving the test should remember that reli- 
able results cannot be secured unless the printed directions 
are followed exactly as they are stated. Directions for 
giving the test are sent with the test blanks. A small 
change in the manner of giving the test may affect the 
scores so as to render them useless. H several rooms are 
to be tested, and the results compared, it is better to have 
all the tests given by one person, so as to have the con- 
ditions more nearly uniform. 

The papers are next scored by means of the printed 
answer lists. The score may be written on the cover of 
each pupil's paper. The "point score," or sum of the 
points made on the various tests, should then be con- 
verted into mental ages. A pupil's mental age is the 
same as the chronological age of the average pupil who 
makes his score. An example will make this clear. If a 
child should make seventy points on a certain intelligence 
scale, and it should be found that seventy points is the 
score made by the average nine-year-old child, then his 
mental age is nine years. This means that he is as mature 
mentally as the average child of nine years. The mental 
age corresponding to each point score will be found written 
opposite the point score in tables furnished by the maker 
of the test. 

The teacher may next wish to find the intelligence 
quotients. As the mental age is the measure of mental 
maturity or ability, the intelligence quotient is the measure 
of brightness, or ability in relation to age. The intelli- 

38 



TESTS OF ABILITY TO LEARN 

gence quotient is found by dividing the mental age by 
the chronological age. For example, if a boy has a mental 
age of 12 and a life age of 10, his intelligence quotient 
(or I. Q.) is 1.20, or, if we drop the decimal point, 120. 
The I. Q. of the average child is, then, 100; that of duller 
children is below 100, and that of brighter children is 
above 100. The exact meaning of any given I. Q. — the 
comparative amount of brightness or dullness implied — is 
shown in tabular form for each test. The intelligence 
quotient can be seen to have a meaning quite different from 
that of the mental age if we compare a child whose mental 
age is 12 and chronological age 10 with another whose men- 
tal age is 6 and chronological age 5. Both will have I. Q.'s 
of 120 and both will therefore be equally bright, but the 
older child, because of his maturity, will be able to solve 
many kinds of problems and work his way out of many 
difficulties, in the face of which the younger child would 
be helpless. Their brightness is the same, but their mental 
ability is very different. 

Many of the current uses of mental tests have been 
mentioned above, and some of the commoner abuses 
pointed out. Some definite use of the results ought to 
be in mind before the test is selected and purchased. 
Thousands of tests are given each year which lead to no 
change whatever in any part of the work of the school. 
Such a waste of the time of teachers and pupils can be 
avoided by careful study of the meaning and implications 
of the measurement of intelligence. 

Caution has already been given about overconfidence 
in results when one is dealing with single pupils. The 
lowest scores, in particular, should be considered tentative. 
They may be the result of fright or temporary indisposi- 
tion as well as the result of stupidity. It is well, there- 
fore, to check these cases by giving a second form of the 
test to such pupils. For this purpose a small number of 
the second form may be ordered when the other test 

39 



STANDARDIZED TESTS 

blanks are sent for. About ten per cent should be sub- 
tracted from the second score in order to allow for the 
effects of the previous acquaintance with this type of test. 
A duplicate form of a test, it may be explained, is another 
edition of the test, made up of exercises similar to those 
in the first form, but not identical with them. 

The teacher who wishes to become thoroughly ac- 
quainted with the literature on intelligence testing should 
subscribe for one or more of the journals mentioned on 
pages 35 and 36, and should read some of the following 
books : 

The Twenty-first Yearbook of the National Society for 
the Study of Education. (The Public School Publishing 
Co.) Part I outlines the nature, history, and principles of 
intelligence testing, and Part II describes administrative 
uses of intelligence tests in various cities and for various 
ages of pupils. 

Terman : The Measurement of Intelligence, and The 
Intelligence of School Children. (Houghton Mifflin Co.) 
Scholarly and readable accounts of the revision and applica- 
tion of the Binet scale. 

Ballard: Mental Tests. (Hodder and Stoughton, Lon- 
don.) A very interesting outline of the subject by an Eng- 
lishman. Includes description of new tests for discovering 
super-normal children. 

Book: The Intelligence of High School Seniors. (The 
Macmillan Company,) Description of the measurement of 
the intelligence of six thousand seniors in Indiana, and the 
relation of the scores to scholarship, vocational preference, 
college choice, economic status, sex differences, etc. 

Trabue and Stockbridge: Measure Your Mind. 
(Doubleday Page & Co.) A well-written general account 
of the significance of psychological measurements, and an 
outline of the "mentimeter" tests. 

Yoakum and Yerkes: Army Mental Tests. (Henry 

40 



m 



TESTS OF ABILITY TO LEARN 

Holt & Co.) A very interesting portrayal of the psycho- 
logical work in the United States army. 

Terman et al: Intelligence Tests and School Reorgan- 
isation. (World Book Co.) A brief general discussion 
followed by a description of the uses now made of intelli- 
gence tests in certain selected cities, large and small. 

Goddard: Human Efficiency and Levels of Intelli- 
gence. (Princeton University Press.) Informal lectures, 
printed within about a hundred pages, discussing mental 
measurement in its social implications. 



41 



CHAPTER IV 

TESTS OF AMOUNT LEARNED 

General Statement 

The development of achievement tests runs closely 
parallel to the development of intelligence tests. The 
achievement or educational tests reached a usable stage 
somewhat later than the individual tests of intelligence, but 
earlier than the group tests. 

It seems that the first definite scale for measuring the 
excellence of school work was made by a certain Reverend 
George Fisher, and was described as being in regular use 
in the Greenwich Hospital School in England, in 1864. 
The Reverend Fisher constructed what he called a scale- 
book — a set of samples of pupil work in handwriting, 
grammar, spelling, drawing, etc. — which was kept on file 
to show the degree of excellence expected in each division 
of the school. No one saw the significance of the device 
at the time, and it was not copied elsewhere. In America 
the beginnings of standardized measurements of school 
products were made by Dr. J. M. Rice in 1895, when 
he put together a list of words which he applied in various 
schools as a spelHng test. His surprising discovery was 
that children in schools which devoted very little time to 
spelling were able to spell as well as children in schools 
which spent many hours in drill on spelling. These results 
he published in the Forum, then a widely-read magazine, 
where as a news feature story they attracted for a short 
time considerable notice. He also presented his results at 
the annual meeting of the National Education Association, 
but there found the coldest of receptions. For the peda- 

42 



TESTS OF AMOUNT LEARNED 

gogues were agreed that his scores showed nothing, since 
he was under a misapprehension as to the purpose of 
education. Dr. Rice, not being a schoolman, had supposed 
that the purpose of teaching speUing was to give the chil- 
dren the ability to spell. The schoolmen informed him 
that the purpose of teaching spelling was to train the 
pupils' minds ! 

After this first attempt at the scientific measurement 
of classroom results had been wrecked in the 90's against 
the dominating concept of formal discipline, nothing fur- 
ther of the kind was reported until 1908. In that year 
Stone published his scale of reasoning problems in arith- 
metic. The following year Thorndike published his hand- 
writing scale, a graded series of samples of penmanship, 
by means of which a pupil's ability was to be determined 
by comparison of his writing with these samples. This 
third recurrence of standardized measurements in education 
was destined to live. In 1914 the movement had come 
sufficiently into favor to be endorsed by the National Edu- 
cation Association — no longer so completely under the in- 
fluence of the theory of formal discipline — and since that 
time its growth has been extraordinary. At present it is 
receiving more attention and, better, more careful painstak- 
ing endeavor, than any other movement in education. Two 
leading journals are devoted entirely to quantitative educa- 
tional studies, and three or four others are giving more 
than half their space to such studies. More and more of 
our graduate students are specializing in this field. As 
these highly-trained men and women procure for us more 
and more facts about pupils and teachers and schools, we 
may expect educational discussions to move more and more 
away from their past boresome character of moralistic in- 
junction and unguarded speculation resting on the flimsiest 
of factual bases — the kind of speculation which leaves one 
with the impression that an equally able man could make 
out an equally good case for the other side of the ques- 

43 



STANDARDIZED TESTS 

tion — and to acquire gradually a set of doctrines which 
will stand permanently and which can be confidently ac- 
cepted and used, because they are based on information 
that is precise, objective, impartial, and verifiable, or in 
other words, scientific. 

It is sometimes believed that the use of standardized 
tests will tend to overemphasize the mechanical aspects of 
school work. This belief is natural, since the tests which 
were first developed, and consequently are now most 
widely used, are tests of the simple and definite results 
of ''drill" work. As these are the outcomes easiest to 
measure, and so most commonly measured, teachers often 
may receive the impression that if they wish to appear 
efficient they must emphasize these mechanical phases of 
their work. The facts, however, are these: First, the use 
of standardized tests for teacher rating should be post- 
poned, for the most part, until we have well-perfected 
tests in a greater variety of subjects. The tests should be 
used at present for pupil diagnosis, pupil motivation, pupil 
rating in particular subjects, for the trial of teaching 
methods, and so on, but not, except in a supplementary 
way, for teacher rating. Second, announced standards in 
the tool subjects should be thought of not only as goals 
to be reached, but also as limits not to be exceeded. In 
cities in which standardized tests are extensively used, bul- 
letins are often issued informing the teachers that they 
are up to standard in the mechanical aspects of their work 
and that they should now devote time to other things. 
Third, the standardized practice tests enable a child to 
learn the tool subjects so much more quickly and thor- 
oughly that more and better work is consequently made 
possible in the content and appreciation subjects. Early 
mastery of the Three R's allows the teacher to devote 
more and not less time to art and music and literature, 
and equip the pupil to do better work than before in 
geography and history and higher arithmetic, where these 

44 



TESTS OF AMOUNT LEARNED 

tools will be of the greatest assistance. This is an effect 
that lasts through the upper grades and through high 
school and college. Fourth, the emphasis which tests give 
to the mechanical aspects of school work is at worst only 
temporary. Tests for the less tangible but perhaps more 
important factors in education are being rapidly developed. 

Are the tests really constructed in a scientific manner? 
Every well-standardized educational scale represents the 
work of a specialist in a given field for months and some- 
times for years. It is made up of exercises or problems 
which have been selected with the greatest skill and care 
to perform the difficult task of testing one particular ability 
of the pupil and only that one; which have been sub- 
mitted to thousands of children to determine their real 
difiiculty; which are prefaced by clear and brief directions 
found by trial to be easily comprehended by children; 
and which can be scored quickly and objectively. These 
scales are proved by ingenious experiment actually to 
measure what they purport to measure, and to measure 
it reliably or consistently. Before they are offered to 
others, they have been submitted to more tests than most 
of us would ever think of. Such scales are commonly 
placed on the market and sold for little more than printing 
costs ; so the effort that has been put into their construc- 
tion ordinarily receives no direct money return whatever. 

For the teacher to take advantage of the results of such 
work and to utiHze standardized tests in her classroom 
seems only common sense. 

The Tests in Arithmetic 

The Courtis Standard Research Tests, Series B, are 
among the oldest and most widely used of the arithmetic 
scales. They are probably used with more than a million 
school children each year. They consist of four tests, one 
each in addition, subtraction, multiplication, and division, 
of integers. Each test is printed on a separate page, and 

45 



STANDARDIZED TESTS 

the time allowances are, respectively, 8, 4, 6, and 8 minutes. 
The pupils are told that they are not expected to finish all 
the examples, but they are asked to work as rapidly and 
accurately as possible. There are two scores : the number 
of examples attempted, as a measurement of rate; and 
the number right, as a measurement of accuracy. The ad- 
dition test consists of 24 examples, each 3 figures in width 
and 9 in height; the subtraction test consists also of 24 
examples, each of these being 8 or 9 figures in width; 
the multiplication test contains 25 examples having 2 
figures in the multiplier and 4 in the multipHcand; and 
the division test contains 24 examples having 2 figures in 
the divisor and 4 or 5 in the dividend. 

In such a test there is obviously no opportunity for 
diagnosis; it is valuable mainly for comparison. It may be 
used for this purpose in any grade from the fourth 
through the eighth. The standards for each grade are 
based on the number of examples which pupils of that 
grade in different cities have been found able to solve. 

The Woody Arithmetic Scale (for grades 2-8), also 
one of the earlier scales, is, on the other hand, of the 
diagnostic type. It consists of four scales printed on 
separate sheets, one for each of the fundamental opera- 
tions. It goes beyond integers to include a few examples 
in fractions. Instead of having all the examples in one 
group to be of equal difficulty, as in the Courtis scale, it 
arranges the examples on each sheet in the order of their 
difficulty. Speed is, then, not measured; the pupil is 
allowed to make his way as far down the scale as he is 
able to go. The Woody scale is diagnostic in the sense 
that it includes in the scale for each operation examples 
of a great many kinds. By looking over the sheet the 
teacher can consequently tell, for instance, which variety 
of addition problems the pupil can work and which he 
cannot. But the examples are not arranged according to 
type — only according to difficulty — and the ability tested 

46 



TESTS OF AMOUNT LEARNED 

by each example must be decided on by the teacher herself. 
Furthermore, there is, in certain editions of the scale, 
only one example to represent each type. One example is 
not enough to test a pupil's ability to work problems of 
a certain type. 

The Cleveland Survey Arithmetic Tests (grades 3-8), 
which were designed at the time of the survey of the 
schools of Cleveland, Ohio, to give a more detailed ap- 
praisal of pupil ability than could be secured by the 
Courtis tests, consist of 15 tests of gradually increasing 
complexity, each test made up of a large number of 
examples of the same kind. Thus we have an attempt 
to make an analysis of arithmetic problems into different 
types, and a thorough test of the pupil's ability to deal 
with each type. The problems of the tests are arranged 
on a spiral principle — each operation recurring several 
times in a different and more difficult form. Thus, addi- 
tion first appears as "Set A," which consists of 65 ex- 
amples, each made up of two figures to be added together; 
it reappears as "Set E," in which there are 16 examples, 
each of five figures arranged in single column; then as 
"Set J," in which there are 14 examples of thirteen figures 
each, each example still a single column; and as "Set M," 
in which there are 12 examples, each five figures high and 
four columns wide. Thus it is seen that for addition 
there is tested successively knowledge of the tables, ability 
to add to a partial sum, attention span or the length of 
time the child can keep up this process and, finally, the 
ability to carry from column to column. Tests to cover 
different kinds of examples in subtraction, multiplication 
and division are arranged on the same spiral principle. 
In this way the teacher can discover which varieties of 
examples a pupil can and cannot work, and is able to give 
him help accordingly. A convenient feature is that a 
pupil's strong and weak points can be seen from a study 
of the scores arranged together on the cover of the book- 

47 



STANDARDIZED TESTS 

let, without leafing through the different pages. This test 
is diagnostic in a very helpful way. It can also be used 
for comparison, as it has standards based on surveys of 
several large cities. 

The Monroe Diagnostic Tests in Arithmetic similarly 
present in a spiral form groups of problems of various 
types. They are different in that they are more complete 
and are printed in four separate leaflets. The first leaflet, 
Test I, is a series of very simple examples in addition, 
subtraction, multiplication, and division of integers, for use 
in the lower grades; Test II is more difficult examples of 
these operations with integers ; Test III is a series of five 
groups of examples involving different "cases" in the 
four fundamental operations with common fractions ; 
Test IV similarly involves the simple operations with deci- 
mal fractions. 

These four measuring instruments in arithmetic show 
the process of development in this field. The Courtis test, 
one of the earliest, was a blanket test allowing no 
diagnosis. The early Woody test was a diagnostic test, 
but it did not attempt to analyze arithmetical abilities into 
types, nor to group problems along such lines. The newer 
Cleveland Survey test has succeeded in making such an 
analysis and grouping. The still newer Monroe test has 
retained this grouping, has extended the scale to include 
thorough tests of common and decimal fractions, and has 
broken the scale up into parts, so that one needs to make 
an expenditure only for the part directly adapted to the 
age of one's pupils. 

These four may be regarded as typical of the "re- 
search" tests in arithmetic, whose function is to discover 
how well the pupils can perform the fundamental opera- 
tions. In addition to these, there are the "practice" tests, 
whose function is to assist pupils to perfect themselves in 
these operations. Among the latter are the Thompson 
Minimum Essentials in Arithmetic (1908), the Studebaker 

48 



TESTS OF AMOUNT LEARNED 

Economy Practice Exercises in Arithmetic (1916), and the 
Wildeman Practice Tests in Fractions (1922). The 
Courtis material consists of a series of about forty prac- 
tice lessons printed on 5x8-inch stiff cards, arranged in 
order of complexity. Each pupil is provided with a tablet 
of the same size, made up of sheets of transparent paper. 
When the pupil inserts a card beneath a sheet of this thin 
paper, the figures on the card show through, and the 
pupil' writes his answers on the tissue paper. Thus the 
card itself is preserved intact. On the back of the card 
the answers are printed, so that when the inverted card 
is placed in the tablet the answers appear just below the 
pupils' answers, and in this manner the practice work 
can be corrected by the pupil himself. The pupil thus pro- 
ceeds from one card to another as rapidly as he masters 
the given type of problem. At intervals test cards are 
inserted — cards without answers on the back. In case of 
these the tissue sheet is torn out and corrected by the 
teacher, who lays it over an answer sheet in her Manual 
of Directions. Each child proceeds at his own rate and 
is always working on the kind of example which he needs 
next to master. He keeps his own record of progress 
upon graph sheets found in his tablet, and thus his prog- 
ress and his objective are always obvious and definite. 

The principle involved in such practice exercises is 
mastery of the educational skills by individual work. It 
means abandoning the idea that skills can be taught by 
mass instruction. It means a partial return to the in- 
dividual teaching prevailing in the days of our grand- 
parents, but a return after all to a quite different level. 
For the newer individual teaching utilizes skillfully graded 
practice material. It not only allows each pupil to proceed 
at his own rate, but assists him by these graded exercises 
to proceed systematically and therefore to master the essen- 
tials rapidly and thoroughly. 

49 



STANDARDIZED TESTS 

The Courtis exercises, taken above as illustrative, have 
brought steady improvement in the fundamentals in several 
different cities. They have proved that allowing each pupil 
to practice daily on the kind of problem which he (and 
perhaps no one else in the class) has reached in his devel- 
opment on that particular day, does result in rapid mastery 
of the basic skills in arithmetic. Others of the practice 
tests, however, are sometimes considered less difficult to 
administer. Thus, the Plymouth and the Studebaker prac- 
tice exercises avoid the necessity of using thin paper 
(which under small sweaty hands exhibits a regrettable 
tendency to curl up and become unmanageable) by printing 
the examples on cards which have cut-out portions through 
which the answers can be written on ordinary paper. The 
Wildeman exercises use still another scheme, printing the 
practice problems in a graded series in a small leaflet, 
which the pupil utilizes by laying his paper just below the 
problem he is to solve. The paper can then be folded 
down and another answer written, and so on indefinitely. 
A plan very similar to this has proved successful in 
schemes for individual teaching worked out in the schools 
of Winnetka, Illinois. 

The teacher should by all means secure descriptive 
literature and samples of these practice tests, and of others 
now available in handwriting and reading, and should plan 
to change over her teaching of the Three R's to an in- 
dividual basis as rapidly as possible. In this connection, 
she should read the accounts of the experiments and 
results in the Winnetka schools, published in the Ele- 
mentary School Journal, September, 1920, and in the 
Journal of Educational Research, March, 1922. All pro- 
motions in that city for two years have been based on 
individual work. Each pupil is allowed to go forward 
at the rate best suited to his ability. But, of course, prac- 
tice tests will also serve as teaching devices in schools not 
using a scheme of individual promotions. 

50 



TESTS OF AMOUNT LEARNED 

Turning now from the tests on the fundamental 
operations to the tests involving reasoning, we find prob- 
lems of the type which are written out in words, such as, 
''How many pencils can you buy for 50 cents at the rate 
of 2 for 5 cents?" In these problems the operation to 
be performed is not indicated, and the ability tested is just 
the ability to decide which operation to perform. The 
oldest of these tests is the Stone Reasoning Test in Arith- 
metic, first published in 1908. It consists of twelve prob- 
lems about school happenings or other interesting affairs, 
arranged in order of increasing diiBculty. It has been 
used in several school surveys, and therefore has satis- 
factory standards. It does not attempt to classify arith- 
metical problems into types. 

The Monroe Standardized Reasoning Tests in Arith- 
metic (1918) have made this classification. Just as the 
Cleveland Survey tests went beyond the Woody by at- 
tempting to analyze the ability to add or subtract into 
several distinct abilities and to devise a separate test for 
each, so the Monroe tests go beyond the Stone by attempt- 
ing to analyze reasoning problems into types and to test 
these separately. The Monroe scale consists of three tests, 
issued separately. Test I, for the fourth and fifth grades, 
is built around simple operations with integers ; Test II, 
for the sixth and seventh grades, involves fractions; and 
Test III, for the eighth grade, involves percentage. The 
problems use the form of language statement found to 
occur most commonly in eight widely-used textbooks. A 
pupil may be given three scores : one for rate, one for 
use of the correct principle in his solution, and one for 
getting the correct answer. 

The Buckingham Scale for Problems in Arithmetic 
(1919) is quite similar in type. The first division is for 
grades 3 and 4, the second for grades 5 and 6, and the 
third for grades 7 and 8. The problems are arranged 
in order of statistically-determined difficulty, and the 

51 



STANDARDIZED TESTS 

pupil's score is the value of the last problem he succeeds 
in solving. No time limit is used. 

To find whether her pupils are up to standard in 
arithmetic, the teacher should give one of the research 
tests on the fundamental operation and one of the reason- 
ing tests. If her pupils are below standard in the oper- 
ations, she should give them practice in this work about 
ten minutes a day with one of the sets of practice tests. 
Enabling the pupil to have definite objectives and to see 
his daily accomplishment in comparison with those of 
other pupils of his own age, is found wonderfully effective 
in securing improvement. Psychologists state that the most 
influential single factor in learning is the purpose to learn. 
Mere repetition of an exercise — unmotivated drill work — 
does not bring mastery of that exercise. But when a 
pupil tries hard because of a definite purpose, his learning 
is rapid. Any adult can verify this from his own experi- 
ence in learning typewriting or piano fingering or golf or 
a foreign language. One-half hour of work under con- 
centrated attention is worth three hours of work of a 
dilatory kind. People fix their attention on their work 
when they have a definite purpose in mind. One of the 
best ways to assist pupils to definite purposes is to give 
them definite daily objectives. The practice tests are so 
arranged as to do this. 

If her class is shown by the research tests to be up 
to standard in the fundamentals and below standard in 
the reasoning problems, the teacher should not spend time 
on the practice tests. She can then with confidence omit 
repetitional work with the operations and can devote extra 
time to instruction in reasoning problems, until a second 
test shows that the class has attained standard proficiency 
in those also. 

The Tests in Reading 

Silent reading ability is the ability to get the thought 

52 



TESTS OF AMOUNT LEARNED 

from what one is reading, while oral reading ability is the 
ability to transmit this thought to others. The two are 
tested separately. 

The simplest of the silent reading tests are the vocabu- 
lary tests. These consist of lists of words whose meaning 
the child is to indicate. In the Thorndike Visual Vocabu- 
lary Scales (1914) the words are arranged in lines of ten 
words each, all the words in one line being of equal 
difficulty. The child is directed to write the letter F under 
every word that means a flower, the letter A under every 
word that means an animal, etc. The words get harder 
as one works down the page, and the pupil's score is the 
score value of the last line in which he gets at least eight 
of the ten words correct. Since all the answers are either 
right or wrong, this little test is quite objective and is 
very easy to give and score. It will be useful in showing 
the teacher the extent of the vocabulary possessed by a 
class she is taking charge of, or in showing the source 
of certain pupils' difficulty in reading — whether the cause 
is failure to understand the meaning of the words, or some 
other deficiency. 

A second type of silent reading test measures the 
pupil's ability to understand reading material expressed in 
sentences and paragraphs. One of the earhest and most 
widely used is the Thorndike Scale Alpha (1916). The 
scale consists of a series of paragraphs, each followed by 
questions to be answered by the pupil. The paragraphs are 
arranged in order of difficulty, and the range is such 
that the lower part of the scale can be understood by 
very young children, while the upper part requires rather 
close attention even from an adult. No time limit is used 
and the child is allowed to go as far as he can. Sample 
paragraphs follow : 

Set II. Difficulty 5.25 

Read this and then write the answers. Read it again if you 
need to. 

53 



STANDARDIZED TESTS 

Long after the sun had set, Tom was still waiting for Jim 
and Dick to come. "If they do not come before nine o'clock," he 
said to himself, "I will go on to Boston alone." At half past 
eight they came, bringing two other boys with them. Tom was 
very glad to see them and gave each of them one of the apples 
he had left. They ate these and he ate one too. Then all went 
on down the road. 

1. When did Dick and Jim come? 

2. What did they do after eating the apples? 

3. Who else came besides Jim and Dick?. 



4. How long did Tom say he would wait for them? 

Set IV. Difficulty 7 

Read this and then write the answers to i, 2, 3, and 4. Read 
it again if you need to. 

You need a coal range in winter for kitchen warmth and for 
continuous hot-water supply, but in summer when you want a 
cool kitchen and less hot water, a gas range is better. The XYZ 
ovens are safe. In the end-ovens there is an extra set of burners 
for broiling. 

1. What effect has the use of a gas range instead of a coal 
range upon the temperature of the kitchen? 

2. For what purpose is the extra set of burners? 

3. In what part of the stove are they situated? 



4. During what part of the year is a gas range preferable? 

It will be seen that some of these answers are to be 
written out in sentences. This requirement makes the 
scoring rather cumbersome, or else somewhat unreliable. 
To make it possible for everyone to score the papers with 
the same results, the author supplies a list of possible right 
and wrong answers. But to take the trouble to look these 
up is time-consuming, whereas to rely only on judgment 
as to whether an answer is correct or not is to get results 
that are not thoroughly objective. This scale has been 
widely used, however, and its standards are based on a 
large number of scores. 

Another group of tests which have been widely used 
are the Kansas Silent Reading Tests (1916). Sample 
paragraphs follow: 

54 



TESTS OF AMOUNT LEARNED 

The air near the ceiling of a room is warm, 

while that on the floor is cold. Two boys are in the 

No 1 room, James on the floor and Harry on a box eight 

Value 1.0 feet high. Which boy has the warmer place? . 

A list of words is given below. One of them is 
needed to complete the thought in the following 
sentence: The roads became muddy when the snow 

No 14 Do not put the missing word on the blank 

Value 4 9 space left in the sentence, but put a cross below the 
word in the list that is next above the word needed 



in the sentence. 



water 
melted 



These tests have a time limit of five mimites, and the 
rate score depends on the number of paragraphs about 
which answers were made out in that time. The exercises 
in the test are very interesting for the pupils, and are 
very easy for the teacher to score. Some of the exercises, 
however, are very much like puzzles and others are like 
arithmetical problems, both of which require thought about 
what is read, rather than comprehension of what is read. 
The weakness of the scale, then, is that it does not always 
measure what it purports to measure. But the scoring is 
thoroughly objective and the standards are adequate. 

A set of tests very similar to the Kansas scale, but free 
from some of its weaknesses, is the Monroe Standardized 
Silent Reading Paragraphs. These attempt to limit the 
question to what would be grasped by understanding the 
paragraph, rather than hy adding one's thought to it. 
They have a time limit, and measure rate and comprehen- 
sion separately. The rate score is the sum of the rate 
values given each paragraph, and the comprehension score 
the sum of the comprehension values of each paragraph. 
Test 1 is for grades 3, 4 and 5; Test 2, for grades 6, 
7 and 8; Test 3, for grades 9, 10, 11 and 12. The para- 
graphs are arranged in order of difficulty, so that everyone 

55 



STANDARDIZED TESTS 

is able to make a score. Here are sample paragraphs from 
Test 1: 

No. 2 
The little Pilgrim girls carried their 
Rate work boxes to the dame-schools and Compre- 

Value 7 learned to sew and knit as well as to hension 

read and write. Value 1.3 

Where did the girls go with their work 
boxes ? To the 

No. 4 
Hiawatha was a little Indian boy. He 
had no father and no mother. He lived 
Rate with his grandmother, Nokomis. His Compre- 

Value9 home was in a wigwam. Draw a line hension 

under the word that tells whom Hiawa- Value 1.4 
tha lived with. 

Father, aunt, mother, uncle, sister, 
grandmother. 

Another of the newer tests is the Courtis Silent Read- 
ing Test No. 2. This consists of a story of about five 
hundred words which is first presented as a whole. It 
secures the rate measurement by having the children read 
through the story for three minutes, while at the end of 
every thirty seconds, when the teacher says "Mark," each 
pupil draws a line around the word which he is reading 
at that time. The comprehension measurement is secured 
by presenting the story a second time broken up into para- 
graphs with questions after each paragraph. In order to 
make the scoring completely objective, all the questions are 
of a form that can be answered by yes or no. The effect 
of guessing, the pupil having an even chance to guess the 
answer correctly, is cut down by subtracting the number 
of wrong answers from the number right. The answers to 
the questions do not require the pupil to go beyond the 
material in the story. The test is adapted to grades 2-6. 
Here is a sample paragraph from the complete story called 
"The Kitten Who Played May-Queen": 

When the day of the party came, Daddy planted a May-pole 
and Mother tied it with gay-colored ribbons. There were to be 

56 



TESTS OF AMOUNT LEARNED 

games and dances on the grass and a delicious supper, with a 
basket full of flowers for every child. 

1. Were the children to have anything to eat? 

2. Were they going to play on the grass? 

3. Were they going into the house to dance? 

4. Were the baskets to be full of flowers?. 



5. Was it Daddy who tied the ribbons to the pole? 

The Burgess Pictorial Supplement Scale (1921) is 
built on quite a different plan. Mrs. Burgess criticizes 
the previous tests in reading as in some cases measuring 
other abilities besides reading ability; in others, being made 
up of such a variety of exercises as to make it difficult 
to interpret the results; and in others, being hard to give 
and score. The Picture Supplement Scale is made up of 
exercises all of one type. A drawing is shown, and just 
below is a paragraph asking that something be done to 
the picture. The thing to be done is very simple, but 
the child cannot do it unless he understands the directions 
as they are given in the printed form. In connection with 
the sample paragraphs given below, one must use his 
imagination to supply the drawing that in each case ap- 
pears just above the paragraph. 

1. This naughty dog likes to steal bones. When he steals 
one he hides it where no other dog can find it. He has just 
stolen two bones, and you must take your pencil and make two 
short, straight lines to show where they are lying on the ground 
near the dog. Draw them as distinctly as you can and then go 
on. 

2. This man is an Eskimo who lives in the far north where 
it is cold. There has just been a big storm, and all the ground 
is white with snow. The man has been walking and has made 
many footprints in it. With your pencil quickly make four of 
them in the snow just behind him. 

19. When the road is rough the porter finds it hard to push 
this wheel chair. Draw a line to show where the road is. Be 
sure to make the line in front of the chair smooth so that the 
chair will roll along easily, but make the line in back of it uneven 
because up to this time the path has been rough. 

In this test there are no puzzles or catches, and the 

57 



STANDARDIZED TESTS 

ivording is simple throughout. The vocabulary is taken 
from the commonest words in the English language, as 
revealed in previous studies of correspondence and news- 
paper articles. The number of ideas in one paragraph has 
been reduced as far as possible and organized around one 
central idea. All the paragraphs are carefully constructed 
to be alike in each of these characteristics. All factors 
which would modify the score without being strictly a 
part of reading ability have been supposedly ruled out — 
such factors as demand for special imagination, or for 
abiHty to remember or to reason. Others, which could not 
be eliminated, have been held constant — difficulty of action 
demanded, vocabulary difficulty, sentence structure, uni- 
formity of print, interesting character of paragraph, etc. — 
so that only one factor is to vary and be measured, namely, 
the amount a child can read and understand in a given 
time. Although six different scales for measuring silent 
reading were completed, printed, and tried out in 23 
school systems before this scale was perfected, the Pic- 
ture Supplement Scale was finally retained as that best 
meeting these requirements. Some of the other scales 
were of the type in which the exercises appear in order 
of increasing difficulty and the child is given no time 
limit, but is allowed to go as far as he can. These were 
finally rejected, for the author believes that rate of reading 
is an important factor in reading ability. Of several 
typists who are equally accurate, the one who turns out 
the most pages per hour is the best worker, and of several 
newspaper reporters who write equally accurate and inter- 
esting stories, the one who gets his copy to the editor's 
desk with the greater speed is considered best. Similarly, 
the child who by endless rereading and rechecking can get 
a piece of work right is not the best student. A time 
limit of five minutes is therefore used in these tests. 

Summary. In the reading tests here described we see 
a development from mere word lists and from rather 

58 



TESTS OF AMOUNT LEARNED 

cumbersome and semi-subjective tests of connected dis- 
course toward tests which measure the complex reading 
ability in a simple and objective manner and without a 
marked modification of the score by other abilities. Com- 
prehension has been measured, in these and other reading 
tests, by asking the meaning of separate words, by asking 
a reproduction of the story (Starch test), by asking ques- 
tions about the story, by a combination of these two 
(Gray test), and by asking the pupil to carry out certain 
directions. Rate (sometimes disregarded) has been meas- 
ured by counting the number of exercises completed in a 
limited time, by counting the words in the exercises com- 
pleted, and by counting the number of words read every 
thirty seconds when no pauses were made for writing an- 
swers. By availing herself of the results of thousands of 
hours of intensive work by experts, the teacher is now able 
to get a very accurate measurement of the rate and the 
completeness of her pupils' comprehension of what they 
read. And silent reading is probably the most important 
study in the curriculum, because it is the key to almost all 
the other studies. 

Oral reading can now also be measured. The simplest 
tests here, as in silent reading, are word lists ; but in this 
case, of course, the ability tested is not the ability to give 
the meaning of the word, but to pronounce it. In the 
Haggerty Visual Vocabulary Tests, for example, the words 
are arranged in groups of uniform difficulty of pronuncia- 
tion, and the child's score is the value of the last group 
in which he can pronounce correctly 80 per cent of the 
words. 

A more complete measurement is secured by using the 
Gray Oral Reading Test, which consists of about a dozen 
successive paragraphs in which the words grow harder and 
harder to pronounce. The lower end of the scale can 
therefore be used with very young children, while the 
upper end of the scale is sufficiently difficult to give pause 

59 



STANDARDIZED TESTS 

even to a mature pupil. This can be seen from the two 
following widely separated selections. The type used for 
young children is of course very much larger than that 
here shown. 

I. A boy had a dog. 

The dog ran into the woods. 

The boy ran after the dog. 

He wanted the dog to go home. 

But the dog would not go home. 

The little boy said, "I cannot go home without my dog." 

Then the boy began to cry. 

II. The hypotheses concerning physical phenomena formulated 
by the early philosophers proved to be inconsistent and in general 
not universally applicable. Before relatively accurate principles 
could be established, physicists, mathematicians, and statisticians 
had to combine forces and work arduously. 

While the pupil reads a paragraph aloud, the teacher 
marks on her own copy, by a definite system of scoring, 
errors of six types: gross errors, minor errors, omissions, 
substitutions, insertions, and repetitions. The directions 
furnished with the test define and illustrate these errors 
and the method of indicating them and deducting for them 
in very specific terms, so that a number of teachers scoring 
the same pupil would give him the same score. This test 
has been given to a large enough number of children to 
furnish quite reliable standards by school grades. It is 
necessarily an individual test, which makes its use rather 
laborious in comparison with the group tests we have been 
describing. For the measurement of the reading ability of 
selected individual pupils whose oral reading needs special 
study, however, it will be found very valuable. 

The Tests in Spelling 

The best-known scale for measuring ability to spell is 
the Ayres Spelling Scale, which is made up of the one 
thousand commonest words in the English language ar- 
ranged in twenty-six columns, each column made up of 
words found by many thousand children about equally 

60 



TESTS OF AMOUNT LEARNED 

difficult to spell. Which words of the English language 
occur most frequently was discovered by studying letters, 
newspapers, and standard literature, in an amount aggregat- 
ing nearly 400,000 words. 

At the top of each column is shown the per cent of 
correct spellings to be expected in each grade. These are 
based on the spellings of about 70,000 children. The scale, 
on account of its content, has been used so extensively 
as curriculum material in spelling that its value as a meas- 
uring device may presently disappear, for the early median 
scores will not continue to represent average attainment. 
But when this happens it will mean that the scale has 
been effective in insuring in the average American child 
the ability to spell correctly the words he will most com- 
monly need to know how to spell. 

One trouble the teacher finds in using this scale is that 
there are hardly enough words in a column (words of 
equal difficulty) to serve to make up a test. To meet this 
situation we now have the Buckingham Extension of the 
Ayres Scales, in which the original list is enlarged to 
about 1,500 words. The new words were selected, how- 
ever, not on the basis of their frequency of actual use 
but on the basis of the frequency of their occurrence in 
standard spelling books. The Iowa Spelling Scale, which 
was built up, in a manner similar to that used by Ayres, 
from the vocabulary found in the correspondence of the 
citizens of Iowa, contains nearly 3,000 words. 

Presenting words for spelling by pronouncing them one 
at a time is not, of course, presenting them in a "natural 
situation." The attention of the pupil is fixed on the word 
itself, while in actual writing his attention is fixed on 
the meaning he is trying to express or on the relation of 
the words to each other, and the spelling of the word is 
in the margin of one's attention. To test spelling in a 
situation more nearly resembling that in which the ability 
to spell will actually function, Monroe has devised a 

61 



STANDARDIZED TESTS 

Timed-Sentence Spelling Test. In this the words to be 
spelled are embedded in sentences, which are dictated at 
a regular rate for copying. None of the test words are 
placed near the end of the sentences, and in order to 
allow for differences in rate of writing the pupils are 
told that if they have not finished a sentence before an- 
other is dictated they are to leave it and begin the new 
sentence. Since this is one of the newer devices, this test 
does not have standards as good as those of the Ayres 
scale. For use in experiments with devices for teaching 
spelling, however, where it is only comparison of earlier 
with later achievements that is needed, the lack of stand- 
ards is not serious. 

Tests in Punctuation and Grammar 

The punctuation scales are made up of sentences 
printed without punctuation marks. The punctuation is to 
be inserted by the pupil. Such tests are used quite extens- 
ively in business schools and in business firms, as well as 
in the public elementary school. They are very simple in 
structure, and are easily adapted to teaching as well as 
to testing. 

The grammar tests present sentences which are gram- 
matically incorrect and are to be corrected by the pupil. 
The Qiarters test, for example, has separate scales for 
pronouns and for verbs. Like the punctuation scales, 
these are not difficult to understand or to use. 

The Composition Scales 

The earlier scales for measuring merit in English com- 
position were blanket scales, measuring all types of such 
ability together. The first, the Hillegas Scale, consisted 
of ten brief sample compositions arranged in order of 
merit, without any description of the good and bad points 
supposedly present in each sample, and without distinction 
between the various forms of writing. The Harvard- 

62 



TESTS OF AMOUNT LEARNED 

Newton Scales, published later, contains four separate 
scales, one each for description, exposition, narration, and 
argumentation. A short analysis is printed below each 
selection, showing its points of strength and weakness. 
The Willing Scale, which has been used in a survey of 
the schools of Denver and other cities, consists of eight 
brief compositions on the theme: An Exciting Experi- 
ence. The compositions are rated separately for ''story 
value" and for ''form value." The Lewis Scales for 
Special Types of English Composition measure ability in 
letter writing. There are scales for judging order letters, 
letters of application, social letters of the narrative type, 
and social letters of the problematic type. Letter writing 
is a form of composition for which scales are especially 
useful. 

In using a modern composition scale, the teacher se- 
cures from her class compositions about one of a list of 
suggested subjects, written under carefully described con- 
ditions; and she then compares each of these with the 
standard samples shown on the scale. The teacher should 
remember that to estimate a composition accurately by 
comparing it with the scale requires a good deal of skill. 
The composition scales, as contrasted for example with 
the arithmetic or spelhng tests, are for this reason com- 
paratively hard to use, and measurement by them should 
not be attempted without a good deal of preliminary train- 
ing. Such training can be secured by practice in rating 
compositions which have already been rated by an expert. 
A collection of such exercises for practice has been pub- 
lished by Thorndike. It is called "English Composition, 
150 Specimens Arranged for Use in Psychological and 
Educational Experiment," and can be procured from 
Teachers College, Columbia University, New York. 

The Handwriting Scales 
Handwriting is like English composition in that it is 

63 



STANDARDIZED TESTS 

measured by comparing samples of pupil work with stand- 
ard samples arranged in order of merit on a scale. The 
handwriting scales are of two types: those for comparison 
and those for diagnosis. 

The Ay res scale, one of those most commonly used 
for comparing one group of pupils with another or with 
standard achievement, is made up of examples of pen- 
manship which are arranged in order of legibility only. 
Legibility in these selections was determined by timed read- 
ings, and the order is therefore completely objective. In 
the "Gettysburg Edition" of this scale the wording of all 
the samples is the same, being the first few sentences of 
Lincoln's Gettysburg Address. When the pupils to be 
tested also write from this address, comparison of their 
work with the scale is made more accurate. Standards for 
this scale are very dependable. 

The Thorndike scale, which is also extensively used 
for comparison, is made up of fourteen examples of pen- 
manship arranged according to a composite of three cri- 
teria: beauty, legibility^ and general merit. Their order 
was determined not objectively but by the consensus of 
opinion of a considerable number of handwriting experts, 
teachers and supervisors of handwriting. Conversion 
tables have been worked out which show the value of each 
of the fourteen Thorndike samples in terms of the eight 
Ayres samples, or vice versa. By this means one using 
either of these scales is able to avail himself of the 
standards of both. 

The Freeman scale, on the other hand, is designed for 
diagnosis. It really consists of five scales printed on one 
sheet. These five measure, respectively, uniformity of 
slant, uniformity of alignment, quality of line, letter form- 
ation, and spacing. A pupil whose score is low on the 
Ayres or Thorndike scales, or whose writing is otherwise 
known to be poor for his grade, can be 'diagnosed' by 

64 



TESTS OF AMOUNT LEARNED 

means of this scale and the exact defect in his writing 
discovered. 

The Gray Score Card, modeled perhaps after score 
cards for judging livestock, is simply a list of the prin- 
cipal characteristics of handwriting, such as alignment, 
size, slant, etc., with a value opposite each from which 
deductions can be made. Such a card when filled out 
shows the strong and weak points of the pupil's writing, 
and indicates to pupil and parent the points in which he 
needs to improve. By making a 'diagnosis' of a pupil's 
writing as a physician does of his physical condition, the 
teacher can direct the pupil to concentrate his practice on 
certain points of technique and, by referring to the diag- 
nostic card later on, can tell whether the practice has 
been effective. The purpose of the card is the same as 
that of the Freeman scale, the difference being that the 
Freeman scale presents samples of writing with which the 
pupil's work can be compared while the card gives only a 
list of the names of the qualities measured. 

In giving tests in handwriting the teacher cannot se- 
cure a measurement of rate by having the pupils copy 
material from print, for in that case the rate of writing 
is confused with the rate of reading; nor by having the 
pupils write from dictation, for then the rate of writing is 
governed by the rate of dictating. The children must 
write a few lines from memory. Whatever selection is 
taken for this purpose — whether it be Mother Goose 
Rhymes or the Gettysburg Address — should be reviewed 
until it is freshly in mind before the test begins. The pupil 
should then write at his regular rate for the period of the 
test. A key copy of the selection can be prepared by the 
teacher for help in scoring the results, annotated to show 
number of words occurring up to any given point. The 
number of words written by each pupil in the time allowed 
can then be seen by a glance at this key. This gives the 
score for rate. The score for quality is secured by finding 

65 



STANDARDIZED TESTS 

the standard sample which is most Hke the writing of the 
given pupil. 

The handwriting scales, like the composition scales, 
will not yield reliable scores until the tester has had con- 
siderable practice in making the comparisons involved. 
Such practice can be facilitated, as in the case of English 
composition, by securing a large number of writing sam- 
ples the merit of which has been determined and recorded 
opposite the number of the sample in a key list. Such a 
group of rated samples has been issued by Thorndike in 
a booklet entitled, ''Teachers' Estimates of the Quality of 
Specimens of Handwriting," procurable from Columbia 
University. 

Other Achievement Tests 

The school subjects discussed above are those in which 
tests and scales are now best developed. But there are 
also standardized tests in history, geography, drawing, 
music, journalism, physical training, manual training, home 
economics, commercial subjects, algebra, geometry, general 
mathematics, general science, physics, chemistry, biology, 
Latin, German, French and Spanish. Samples of these 
tests may be secured from the publisher named in the 
Bibliography referred to on page 36. 

"Home-Made"' Objective Tests 

A teacher who wishes to secure a more accurate meas- 
urement of her pupils, not so much for the purpose of 
comparing them with pupils elsewhere as for the purpose 
of comparing them with each other or with her previous 
classes, can do so without depending on the published 
standard tests. She can make a test of her own. To do 
this she should construct a series of simple and unambigu- 
ous statements about the material recently covered in class. 
The following examples, taken from courses in Education, 
will show what is meant: 

66 



TESTS OF AMOUNT LEARNED 

True-False Test 

Directions: Place before each statement the word true or the 
word false. 

11. The socialized recitation is better adapted to dull 

pupils than to bright ones. 

19. Scientific bases for the curriculum have been more 

carefully worked out in spelling than in geography. 

Best-Reasons Test 

Directions: In each case, place a cross before the best 
reason. 

1. If the project method is better than the older methods of 
teaching, that is because : 

It develops the pupil's originality. 

It is being advocated by most colleges of education. 

It can be more quickly mastered than the older methods. 

It makes sure that all pupils cover the same ground. 

2. The reasoning tests in arithmetic are not so widely used 
as the tests on the fundamental operations because: 

It takes more time to give the reasoning tests. 

Reasoning abiHty is comparatively hard to measure. 

The designers of the operations tests are men of more 

prestige. 
The reasoning tests have only recently been devised. 

Multiple-Answer Test 

Directions: Indicate the correct answer by underlining one 
word or phrase in each parenthesis. 

3. The first scientific educational scales were worked out in 
(history, geography, penmanship, geometry). 

10. The most reliable test for the measurement of the intel- 
ligence of children is the (Voelker, Liao, Binet, Thurstone). 

This kind of examination, when used at the end of 
a semester, has many advantages over the traditional ex- 
amination. For one thing, it covers a great deal more 
ground. A class can easily mark from fifty to seventy- 
five of these statements and have time left to score the 
papers during the same class-hour. The large number of 
statements means a more thorough review by the pupil, 
less luck in grades from getting questions on just the 
material that happened to be reviewed, and a more thor- 
ough test by the teacher. Again, the score is independent 
of the teacher's subjective standards of scholarship. The 

67 



STANDARDIZED TESTS 

score depends only on the number of right and wrong 
answers, and will be the same no matter who grades the 
papers. This means a fairer rating of the pupils. It also 
means a better relationship between teacher and pupil, for 
the teacher instead of being a judge, a person standing 
perhaps between the student and his diploma, can take che 
role of a guide and helper who assists the pupil to secure 
that mastery of subject-matter which will be indicated by 
a high score on this objective and impartial examination. 
Again, such an examination (when once constructed) is 
time-economy and spares the teacher the worst drudgery 
in teaching — the reading of a large number of answers 
monotonously alike. This type of examination also saves 
the pupil from the drudgery of writing out long answers. 
In the author's classes, and in others which have reported 
using the test, the students after trying it have always 
voted almost unanimously in favor of this kind of an 
examination. Furthermore, the test can be used as an 
excellent teaching device, for if it is given so as to allow 
another hour for discussion of the statements, it will serve 
as a very good outline of the course to date; and the facts 
in it are likely to be remembered if they are talked over 
after being presented to the concentrated attention char- 
acteristic of the examination hour. Another advantage of 
such a test is that if the test papers are always collected 
after the discussion, alternate forms of the test can be 
used year after year and standards of attainment can be 
gradually built up. This allows one to grade a given 
pupil by comparing him not only with the other members 
of his class, but with all previous classes which have taken 
the test. 

Certain disadvantages in the use of such an examina- 
tion will, to be sure, be found. There is, of course, a 
chance to get the answers correct, especially on the true- 
false type, by lucky guessing. This is cut down by the 
scoring scheme used for the true-false type; namely, sub- 

68 



TESTS OF AMOUNT LEARNED 

tracting the number of wrong answers from the number 
right. This ehminates guessing, in case there is a large 
number of statements, as can be seen from the following 
formulae. If x is the number of statements for which 
the pupil knows the answer; v, the number he guesses 
right; and b, the number he guesses wrong; then the total 
number of right answers is x plus y, and the final score is 
X plus 3; minus 2. But since he has an equal chance to 
guess a statement right or wrong, the number guessed 
right will be approximately equal to the number guessed 
wrong; that is, 3; is equal to 2. Then in the formula, score 
equals x plus 3; minus 2, y and 2 will cancel each other, 
leaving the score equal to x, or the number known, as it 
should be. This demonstration, however, is based on the 
assumption that the number guessed correctly and incor- 
rectly will be about the same, which is true only in case 
a considerable number of statements can be guessed at, 
just as it will be found that a coin when tossed will show 
the same number of heads and tails only when a large 
number of tosses are made. For these reasons, a true- 
false test ought to contain a hundred or more statements 
if its results are to be used as the basis of important 
school marks or grades. These limitations are not found 
in the best-reasons and multiple answer types of examina- 
tion (when some four or five answers for each question 
are shown), and it is therefore such types that are par- 
ticularly recommended to the teacher. 

Another disadvantage of these examinations is that 
they do not test the pupil's power to organi2e his thought 
This drawback can be dealt with, however, by supplement- 
ing the objective examination with short essay papers 
written at home. These probably give a better measure 
of the pupil's power to organize, and better practice in 
organizing, than do the answers hastily put together in 
the stress and hurry of an examination hour. The use 
of such a plan will involve the teacher again in reading 

69 



STANDARDIZED TESTS 

student papers, but it will be a type of reading not half 
so tiresome. For the home essays can be written on 
topics chosen from a considerable number of alternatives 
and can be prepared with more care, and they will there- 
fore furnish a variety of reading that is often really very 
interesting. 

Of course, these objective examinations made by the 
teacher herself are less accurate measuring instruments 
than the standardized tests, because they are made up of 
statements which, while not known to be equal in diffi- 
culty, are usually given equal credit. And they are less 
useful diagnostic instruments, because the statements are 
not ordinarily arranged according to the type of informa- 
tion imphed in correct answers. Their scores should not, 
therefore, be considered to be as dependable as those on 
real standardized tests. But we should also remember 
that, in spite of these limitations, these examinations have 
l)een found by actual trial to be at any rate more accurate 
than the traditional examination. In fact, the traditional 
examination suffers from exactly these same limitations, 
as well, it might be added, as from many others. 

In giving such objective examinations, it is best to 
have the statements multigraphed. A card having on it 
the pupil's name and a key number can be given to each 
member of the class, and the number instead of the name 
can be placed on the paper. When the papers are re- 
distributed to be scored as the teacher reads the answers, 
this anonymous character of the papers promotes honesty 
in the scoring. A further check can be secured by having 
each pupil write on each paper he scores : ''Scored by 

," signing his name. By devices of this kind, 

pupil-scoring can be made reliable, and the time the teacher 
ordinarily spends in reading test papers can be saved for 
something more profitable. 

If it is not practicable to have the examination multi- 
graphed, the statements may be written on the blackboard 

70 



TESTS OF AMOUNT LEARNED 

in advance, or they may be read aloud by the teacher. In 
such a plan the pupil may take down only the number of 
the statement, and indicate his answer by writing ''true" 
or ''false" beside it, or in the case of the best-reasons test 
by writing "second reason" or the like, or in the case 
of the multiple-answer test by writing the single word 
needed for the answer. 

That such examinations actually give dependable re- 
sults has been proved by comparing their scores with other 
records of scholarship, such as grades on traditional exam- 
inations in the same class, or the average grades at the 
end of the previous semester. By the latter criterion, the 
new objective examination stands higher than the tradi- 
tional examination. Successive ratings of the same class 
by an objective examination have also been found to agree 
better with each other and with home-essay ratings than 
the traditional examinations do with each other or with 
home essays.* 

Fully standardized tests in sufficient numbers to cover 
all parts of the curriculum are probably a long way in 
the future. In the meantime the teacher can give a very 
much fairer rating to the work of her pupils, can establish 
better relations with them, and can relieve both herself 
and her pupils of a great deal of drudgery, by using some 
form of the "home-made" objective examinations here de- 
scribed. The only hindrance to putting such a plan into 
general effect is the time it takes a teacher to collect and 
phrase clearly and briefly the large number of statements 
which it requires. With the older children, however, such 
formulations can be made out by members of the class 
as home assignments. This scheme is an excellent one for 
giving practice in discovering the most important points in 
a chapter, and for well-motivated practice, in a "real situ- 

* For a statistical presentation of this evidence, see Gates, The True-False 
Test as a Measure of Achievement in College Courses, Journal of Educational 
Psychology, May, 1921. Or, Wood, Measurement of a College Work, Educa- 
tional Administration and Supervision, September, 1921. 

71 



STANDARDIZED TESTS 

ation," in effective use of terse and unambiguous English. 
The true- false and best-reasons and multiple-answer 
examinations should be thought of as the intermediary be- 
tween the old-fashioned examinations whose ineffectiveness 
was discussed in Chapter I, and the tests which can really 
be called standardized tests. The objective exramination 
made by the teacher will serve to bridge the gap from the 
one to the other. 

What to Do 

The reader who wishes to become familiar with the 
standardized tests which have so far been worked out for 
her own subject or grade should now write to the principal 
publishing houses, named on page 37, and from their price 
Hsts should order samples of the tests which seem best 
fitted to her needs. She should aiso secure the Bibli- 
ography of Tests mentioned on pagfe 2>6, and, when pos- 
sible, read the cited reference to a more complete account 
of the test by its author. By means of these, and by 
means of the discussions of the tests in some of the longer 
textbooks cited at the end of this chapter, she will be able 
to decide which of the tests she wishes to secure. 

Selecting an Achievement Test 
In making a selection among the numerous tests and 
scales which are now available, the teacher should keep 
in mind the exact purpose of the experiment in hand. 
If the purpose is to compare a room or school with aver- 
age attainment, then she should choose a test which has 
been widely used and which bases its standards upon a 
large number of cases. If the purpose is to measure pupil 
progress over a series of weeks or months, then she should 
choose a test which has a considerable number of "forms" 
(duplicate editions), in order that the score at the time 
of the second testing may not be influenced by the pupils' 
remembering any of the previous answers. If the purpose 
is diagnosis, then she should choose a test which is finely 

72 



TESTS OF AMOUNT LEARNED 

divided and measures the various related abilities sep- 
arately. 

The teacher should also make sure, of course, that the 
selected test is adapted to the age of the children she is 
teaching. The time used in administering the test should 
be considered, in order that it may not exceed the length 
of the class period, unless one wishes to extend the period 
or to break up the test into parts. 

In case of a scale, such as the handwriting scales, one 
should also consider the time necessary for learning to use 
the scale accurately. Not the least important consideration 
is money cost, and this may sometimes be kept down by 
selecting a test which is designed for a particular grade 
rather than for a series of grades. Having an uncopy- 
righted test multigraphed is not advisable if one is to 
compare her class with standards, since the change in the 
appearance and the arrangement of the problems will affect 
the scores. Almost no outlay of money will be necessary 
for measurements by such scales as those in handwriting 
or English composition, for these require only one or 
two copies of the scale for the entire room. But such 
scales are probably the most expensive of all from the 
point of view of time-cost. 

Instructions for giving and scoring the tests are usually 
enclosed in the package of test blanks when it is sent 
out by the publisher. These instructions, as has been said 
before, should be followed to the letter. Methods for 
working up the scores, in ways that will show what they 
reveal about a class, are explained in the next chapter. 

The teacher who wishes a fuller discussion and descrip- 
tion of achievement tests than is possible in so brief a 
sketch as this, should consult some of the following text- 
books, which are arranged according to date of publication: 

Monroe, DeVoss, and Kelly: Educational Tests and 
Measurements. 1917. (Houghton Mifflin Company.) De- 
scribes a considerable number of the tests in use in each 

73 



STANDARDIZED TESTS 

of the principal school studies in 1917, and briefly explains 
some practical statistical methods for treating results. 

Monroe: Measuring the Results of Teaching. 1918. 
(Houghton Mifflin Company.) Gives more emphasis than 
its predecessor to the remedial instruction which should 
follow application of the tests. 

Wilson and Hoke: How to Measure. 1920. (Mac- 
millan Company.) An elementary handbook for those 
giving tests. 

McCall: How to Measure in Education. 1922. (Mac- 
millan Company.) An advanced textbook, primarily of 
interest to school supervisors and administrators. Explains 
how to use educational tests, how to construct and stand- 
ardize them, and how to use statistical methods. Technical 
and difficult. 

Pressey: Introduction to the Use of Standard Tests. 
1922. (World Book Company.) An elementary outline 
presenting in simple language much practical information. 



74 



CHAPTER V 

PUTTING MEANING INTO SCORES 

Tables 

The large mass of scores resulting from a test series 
have very little meaning for the teacher just learning to 
use measurements. She is not able at first to see what 
they reveal about her class. She can interpret her results 
only by learning something of those ingenious methods 
for handling mass data which were developed for this 
purpose in economics and biology, and which have more 
recently proved so influential in psychology. When once 
understood, the computations involved are really very 
simple. 

The first thing to do in finding the meaning of a set 
of scores is to put the scores in order. The test papers 
may be sorted so as to have the best paper at the top of 
the stack, the next best paper just below it, and so on 
down. It will then be desirable to make a record of the 
scores, in this way: 

Table I 

Pupil's Name Score 

Henry Jones 27 

Mary Brown 25 

Thomas Smith 24 

Charles Green 24 

Etc. Etc. 

By a glance at such a completed table one can see the 
general distribution of the scores — what the highest and 
lowest scores are, and what are the scores in the middle 
of the table, which the average child of the class is able 
to make. It will usually happen that a number of pupils 

75 



STANDARDIZED TESTS 

will earn the same score. A shorter and more convenient 
table can then be made in this way: 

Table II m"?^^*" ^x! ^q"^''* 

Making the Score 
Score (Frequency) 

58 1 

57 2 

56 2 

55 3 

54 5 

53 8 

52 5 

51 4 

50 4 

49 ' 3 

48 1 

47 1 

This is called a Frequency Table. It shows the fre- 
quency with which each score occurs. 

When the scores spread over a very wide range and 
there are a considerable number of them, it is convenient 
to arrange them in groups, in some such way as this: 
Table III 

Score Frequency 

140-149 2 

130-139 4 

120-129 7 

110-119 6 

100-109 7 

90-99 10 

80-89 12 

70-79 11 

60-69 8 

50-59 6 

40-49 4 

30-39 2 

20-29 1 

76 



PUTTING MEANING INTO SCORES 

This table says that there were 2 pupils who made 
scores between 140 and 149 inclusive, that there were 4 
pupils who made scores between 130 and 139 inclusive, 
and so on. A very large number of scores can be recorded 
in a compact and intelligible form by this plan. 

Of course if the range in the scores had not been so 
wide as here indicated, a smaller interval could have been 
used for the score column, thus: 

Score 

115-119 

110-114 

105-109 

100-104 
etc. 
Here we are using intervals of five instead of intervals 
of ten. In the case of the set of scores represented by 
Table III, intervals of five would have made the table too 
long to be convenient. The table would have contained 
twenty-six divisions instead of thirteen, — an unnecessary 
fineness of distribution for most purposes. A good work- 
ing rule is to so choose the intervals of the score column 
that the table will have some ten or twelve divisions. This 
can be done by subtracting the lowest from the highest 
score, pointing off one place (dividing by ten), and taking 
the nearest convenient unit. For example, if 22 and 146 
are the low and high scores in the group represented in 
Table III, then 146 — 22 = 124, and pointing off one place 
gives 12.4, from which the interval may be taken as 12, 
or as the nearest round number, 10. 

The Median 

In order to compare one class with another, or with 
a standard, we need some one number to stand for the 
class as a whole, to represent the 'central tendency' of 
the class. Our first thought would be to compute the 'av- 
erage' of the scores. But there is a measure of the central 

77 



STANDARDIZED TESTS 

tendency of the class which is so easily found that it is 
in universal use among makers of standardized tests, and 
must therefore be used instead of the 'average' by those 
who would compare their results with standards. This 
measure is the median. The median may be thought of 
simply as the middle score — the score of such a kind that 
half the scores are better and half are not so good. It 
is a measure easy to understand and easy to calculate. It 
is for many purposes a more reliable measure of central 
tendency than the 'average', because it is not markedly 
affected, as is the average, by a few extreme cases. 

To find the median, one counts in from either end of 
the ordered distribution of scores until he finds the middle 
score. If one has sorted the test papers according to size 
of score, then the median may be taken as the score on 
the middle paper. (If there are an even number of 
papers, the score on the two middle papers may be aver- 
aged together for the median). If one has written down 
the scores as shown in Tables II and III, he can count in 
by the same method from one end of the table. For ex- 
ample, in Table II there are 39 scores, and the middle 
paper is therefore the 20th, for there are 19 papers better 
and 19 not so good. The 20th paper is one of the 8 
which received scores of 53. The approximate median 
score is therefore 53. 

In more precise work it may be desirable to take the 
median, not at any particular score, but as the point on the 
score-line above and below which there is an equal num- 
ber of measures. We can then express the median to the 
nearest tenth or hundredth of an integer. This may be 
done, in the example taken, as follows: Counting in from 
the bottom of the table, as before, we have 1 -|- 1 + 3 + 
4 + 4 + 5 = 18. 39 ~ 2 = 19.5. We now take the me- 
dian score as that numbered 19.5, or exactly half the 
number of scores. As we wish to find the theoretical score 
No. 19.5, and as in our counting we have passed 18 scores, 

78 



PUTTING MEANING INTO SCORES 

it is plain that we must go 1.5 scores farther 
(19.5 — 18=1.5). We may assume that the 8 papers 
scored 53 would, if they were more accurately scored, in- 
clude scores from 53.0 up to 53.9, and that they may be 
thought of as spread evenly over this interval. The median 
paper, for which we are looking, would then be -g^, or 
i%o of the distance across the interval. The distance 
across the interval from 53 to 54 is one unit, and i%o 
of 1 is .18, or approximately .2. This is the amount by 
which the median score exceeds 53. Adding it to 53 gives 
53.2, the true median. 

In calculating the median in an arrangement of scores 
such as that of Table III, the scheme of "interpolation" 
just explained is always necessary, and is here perhaps 
easier to understand. In Table III there are 80 scores 
represented (sum of frequency column). —=40. 
l_|_2 + 4 + 6 + 8+ll =32. 40 — 32 = 8. 8/12 of 
10 = 6.7. 80 + 6.7 = 86.7, the median. It is here seen 
that the 12 scores which include the one sought, the 40th, 
are spread between 80 and 89. If they may be assumed 
to be spread evenly across this interval, then the 40th is 
eight-twelfths of the distance across. The distance across 
the interval is 10 units. Adding eight-twelfths of 10, or 
6.7 to the beginning of the interval, or 80, will then locate 
the median. In making this interpolation, one should be 
especially careful that the correction (here 6.7) is added 
to the right number. Many more mistakes are made in 
this step than in finding the correction. It is a good plan 
to check the work by counting from the other end of the 
table. In Table III this would be done as follows: 
2 + 4 + 7 + 6 + 7 + 10 = 36. 40 — 36 = 4. ^2 of 
10 = Z.Z. 90 — 3.3 = 86.7, the same result. In counting 
down in this way from the large scores to the small ones, 
one must remember that the correction (3.3) is in this 
case to be subtracted from the upper end of the interval, 

79 



STANDARDIZED TESTS 

not added. The upper end of the interval 80-89 is taken 
as 90 because 89 is only an abbreviation for 89.99999 -|-, 
which goes as close to 90 as you please. 

It should be pointed out that in a list of scores such as 
that of Table II the score 53 may mean either 53.0 to 53.9, 
or 52.1 to 53.0, or 52.5 to 53.4. That is, all papers may 
be called 53 if the exact score lies between 53 and 54, or 
if it Hes between 52 and 53, or if their score is somewhere 
in the interval whose middle is 53. To make sure which 
of these three things the score 53 means, the tester should 
consult the account of the scoring plan which appears in 
the teacher's manual sent with the test. In practice, how- 
ever, the score 53 may be assumed to mean 53 to 53.9 for 
all common scales except those of handwriting and English 
composition. 

The calculation of the median should be thoroughly 
mastered. It is fundamental for all uses of standardized 
tests for comparing one group with another, or with their 
own past performance. The reader will do well to make up 
and work out examples similar to those shown here until 
the described checking plan shows him that his computa- 
tions are reliable. 
Summary of Steps in the Calculation of the Exact Median 

1. Make a frequency table, collecting the scores if 
necessary into about ten or fifteen groups. 

2. Find half the number of scores. Call this^. 

3. Count in on the frequency column, from the 
smaller scores toward the larger, until adding another 
number would make the sum larger than -|-. 

4. Subtract the sum so found from-y-. 

5. Divide the remainder by the number of scores in 
the next group on the frequency line. 

6. Multiply the result by the number of units in the 
interval on the score line. (If the scores are not grouped, 
the interval on the score line is 1). 

80 



PUTTING MEANING INTO SCORES 

7. Add this number tx) the beginning of the interval 
opposite the number used in step 5. 

8. This is the exact median. 

9. Check by counting in from the opposite end of the 
table, in this case subtracting the correction (found in 
step 6) from the upper end of the middle interval instead 
of adding it to the lower end. 

The Quartile Deviation 

Besides a measure of the "central tendency" of his 
class, the tester will often want a good measure of their 
variability, or range of score. The total range, found by 
subtracting the lowest score from the highest one, is not 
very informing, because it is too much affected by one or 
two extreme scores. A single very bright or very dull 
pupil may give one an entirely wrong idea of the amount 
of variability characteristic of his class. It would be better 
if a few of the extreme scores could be disregarded. The 
common practice in this matter is to disregard the best 
quarter and the poorest quarter of the scores. Subtract- 
ing the lowest from the highest of the scores which then 
remain will give a measurement of the amount of varia- 
bility in the middle half of the class. Dividing this amount 
by two will then show how far the most extreme pupils 
in the middle half of the class deviate from the median. 
This could be worked out directly from the stack of test 
papers arranged according to size of score, as was ex- 
plained for the median. It could also be worked out from 
tabular arrangements of scores, such as those of Table II 
and Table III. The calculation for Table III would be 
as follows : 80 -- 4 =r 20. 1 + 2 + 4 + 6 = 13. 20 — 13 =: 
7. % of 10 = 8.7. 60 + 8.7 = 68.7, the lowest quartile 
point. Similarly, counting in one-fourth of the way from 
the upper end of the scale we have 2-f-4-f-7-f-6^ 19. 
20—19=1. 1/7 of 10=1.4. 110—1.4=108.6, the 
upper quartile point. 108.6 — 68.7 = 39.9, the range of 

81 



STANDARDIZED TESTS 

the middle half of the class. 39.9 -^2 — 19.9, the quartile 
deviation, or the amount that the most extreme persons in 
the middle half of the class deviate from the median. 

The Mean Deviation 

A slightly different measure of variability, the mean 
or average deviation, is sometimes preferred because it 
takes into account all the pupils of the class. It may be 
calculated simply by subtracting from the median each of 
the scores smaller than the median, and subtracting the 
median from each of the scores larger than itself. This, 
of course, shows how far each pupil deviates from the 
median. Adding these deviations and dividing by the 
number of scores then gives the average or mean devia- 
tion. 

The Standard Deviation 

Still another measure of variability, the standard devia- 
tion, is computed in a similar way, except that the de- 
viations are each squared. The sum of the squares is 
found and divided by the number of scores, and the square 
root taken of the result. This is the measure of variability 
most commonly used in statistical work which is carried 
out on a large scale. It is the expression for deviation 
which will ordinarily be found in printed accounts of 
experiments with educational measurements. 

The Measurement of Relationship 
It is often desirable to measure the closeness of rela- 
tionship between two sets of scores. For example, one 
may wish to know how closely two intelligence tests agree 
with each other in their ratings of a certain group of chil- 
dren, or how closely arithmetic and reading scores agree 
with each other (whether pupils scoring high in arithmetic 
are likely to score high in reading), or how closely ability 
in oral reading corresponds with ability in silent reading. 
The amount of relationship present can be shown either 
graphically or numerically. 

82 



PUTTING MEANING INTO SCORES 

Graphic Methods. One of the commonest graphic de- 
vices is the scatter diagram. This is made by laying off 
at equal distances along one side of a piece of cross- 
section paper the scores in one test, and laying off at right 
angles the scores in the other test. A dot may then be 
made on the cross-section paper where the lines represent- 
ing the two scores of a given pupil intersect. Figure I 



UJ 

1- 

u 

z 

UJ 

-J 

UJ 

1- 
z 

>- 

(/) 
q: 
u 

> 

z 
z 

^ 

o 
a: 
ffl 

z 
o 

tn 
u 

O 




































































































































































































































• 






























































































































• 




• 


















i 
















































































































• 






































• 
























































































( 


> 




































• 


• 










































• 


• 












































• 














































































1 


K 
























































































• < 












































































































































































































• 














































































• 


























































































> 












































• 










































• 










































• 




























































































































































• 






























































































































15 50 55 60 65 70 75 80 85 90 95 100105110115120125130135140145! 




SCORES ON THURSTONE INTELLIGENCE TEST | 



FIGURE I. 

83 



STANDARDIZED TESTS 

shows the general plan. A certain pupil had a score of 
71 on the Thurstone test and a score of 58.8 on the Brown 
University test. Find the dot which represents him. 
Another pupil had a score of 107 on the Thurstone test 
and a score of 67.1 on the Brown University test. Find 
his location on the diagram. 

If, as shown on this figure, the scores laid out 
horizontally are placed on the diagram so as to increase 
from left to right, and those laid out vertically are ar- 
ranged to increase from the bottom of the sheet toward 
the top, then pupils who get high scores on both tests 
will be located in the upper right-hand corner of the figure, 
those who get low scores on both tests in the lower left- 
hand corner, and those standing medium in both tests will 
fall near the center of the figure. This means that, if the 
two sets of scores are plotted so as to be equally spaced 
along the two sides of the paper, the closeness of the 
relationship between them can be seen from the closeness 
with which the dots are grouped about a line running 
from the lower left-hand corner to the upper right-hand 
corner of the figure. This is what we wish to read from 
the figure. 





FIGURE III. 
85 



STANDARDIZED TESTS 

Another simple graphic device is made by arranging 
the pupils' names in order according to their standing on 
one test, and then showing opposite each name the rank 
of the pupil in each test. If lines are drawn from No. 1 
in one column to No. 1 in the other, from No. 2 in one 
column to No. 2 in the other, and so on, as shown in 
Figure II, the amount of crossing of the lines will show 
the amount of disagreement between the two tests, and 
the steepness of certain lines will show the extreme de- 
grees of disagreement, or the amount certain pupils change 
in rank from one test to the other. 

Another simple graph can be constructed by plotting 
the ranks of one over the other, as in Figure III. Here the 
pupils have been arranged in order according to their scores 
on the Thurstone intelligence test, and the ranks plotted as 
the straight line. (The names or key-numbers of the pupils 
are written along the bottom of the sheet). The ranks 
in the Brown University intelligence test are plotted to the 
same base for comparison. If all the pupils should rank 
exactly the same in both tests, the two lines would co- 
incide. The amount of disagreement between the two tests 
is therefore shown by the amount in which the lines 
diverge from each other. 

A graph of a kind commonly seen in newspapers and 
miagazines is shown in Figure IV. Different kinds of 
shading are used to represent different degrees of the 
quality represented. For example, the bars might repre- 
sent I. Q.'s of a class according to each of the two intelli- 
gence tests : the heavily hatched portion of each bar might 
represent I. Q.'s between 70 and 79, the next lighter por- 




FIGURE IV. 
86 



PUTTING MEANING INTO SCORES 

tion I. Q.'s between 80 and 89, and so on. Whether the 
two intelHgence tests agree in their distribution of various 
degrees of abiUty among the members of a class could 
then be seen at a glance. 

Still another common method of representing relation- 
ship is the use of parallel bars whose total length depends 
on the amount of the thing represented. Two bars may 
be made for each pupil, one representing, say, his rank 
in class according to one test, and the other his rank 
according to the other test. These bars may be differently 
shaded or differently colored. The degree in which the 
various pairs of bars agree in length would then show 
the amount of agreement between the two tests. It is best 
to arrange the names of the pupils (written below the 
bars) in order according to one measure or the other. 
This makes the figure considerably easier to read. 

Graphic devices give a striking representation of results, 
but not one which easily allows precise statements as to 
the amount of relationship discovered. One cannot always 
say, from an examination of graphs, whether the agreement 
between a certain pair of tests is greater or less than the 
agreement between another pair; and he can never say 
how much greater or less it is. For such information we 
must rely upon numerical rather than graphic treatment 
of the data. 

Numerical Methods : Frequency Table. The numerical 
methods of indicating the degree of relationship between 
two sets of scores indicate the facts very definitely. One 
of the best is the frequency table in terms of changes of 
rank. If all the pupils are ranked according to one test 
and then according to the other, and the differences in 
rank found, a frequency table can be made of the differ- 
ences. The method is shown in Table IV. In Part II of 
this table it is seen that one pupil kept exactly the same 
rank in both tests, three pupils changed rank one place, five 

87 



STANDARDIZED TESTS 



pupils changed rank two places, and so on. Information 
of this kind is often of very high practical value. 

Table IV, Part I 

Rank in Rank in Difference 
Student Test I Test II in Rank 

A. C 1 3 2 

B. D 2 1 1 

E. J 3 4 1 

R. L 4 2 2 

S. M 5 9 4 

E. F 6 6 

R. S 7 16 9 

A. V 8 5 3 

S. Z 9 7 2 

L. M 10 18 8 

N. 11 10 1 

P. Q 12 17 5 

N. M 13 8 5 

N. L 14 11 3 

V. W 15 12 3 

X. Z 16 14 2 

S. P 17 15 2 

S. D 18 13 5 

Table IV, Part II 

Difference 
in Rank Frequency 

1 

1 3 

2 5 

3 3 

4 1 

5 3 

6 

7 

8 1 

88 



PUTTING MEANING INTO SCORES 

Sum of Differences in Rank. Another good method 
of measuring relationship between two sets of scores is 
to find the sum of the differences in the ranks given the 
pupils by the two tests. The sum of the right-hand 
column in Table IV, Part I, can be compared either with 
the sum of the differences in rank if the differences were 
made as large as possible, or with the sum in case the 
rankings were left to chance. The sum of the differences 
in rank when rank is left to chance is " ~ , where n is 
the number of pupils. The sum of the maximum dif- 
ferences in rank can be found very quickly, even for a 
large group, if one observes that the column of differences 
in this problem takes the form of two ordered series, 
1, 3, 5, 7, 9, etc. (see Table V), and that they can there- 
fore be added by the formulae, l=a+(n — l)d, and 
s = (a -j- /)-y-, in which a is the first term of the series, 
/ the last term, n the number of terms, d the common 
difference between the terms, and s the sum of the terms. 
Comparing the sum of the actual differences in rank with 
the sum of the maximum dift'erences, and with the sum 
of the differences when rankings are left to chance, gives 
a fairly good idea of the amount of agreement between 
two tests. .^^^j^ ^ 

Rank in Rank in Differences 

Test I Test II in Rank 

1 10 9 

2 9 7 

3 8 5 

4 7 3 

5 6 1 

6 5 1 

7 4 3 

8 3 5 

9 2 7 

10 1 9 

89 



STANDARDIZED TESTS 

/ = a+ (n—l)d=rl +(5 — 1)2 = 9 

s=(a + /)-f-=(l+9)%=25 

5 = 25 X 2 == 50 
The Coefficient of Correlation. The measure for rela- 
tionship which is most commonly used is the coefficient of 
correlation. This is a much more compact way of repre- 
senting results, as the whole story is told by a single 
number, or at most by a pair of numbers. The devices 
described above express the relationship either by means 
of space-filling tables or by some rather long phrase, such 
as ''the average change of rank is 12 out of a possible 38," 
or "the sum of the differences in rank is 1,283, as com- 
pared with a random-change sum of 1,822, and a maxi- 
mum-change sum of 2,346." The coefficient of correlation 
says all this by one number, for example, .72. The for- 
mula leading to this convenient result takes several forms, 
but the one best meeting the needs of a teacher who is 
working with a single class may be expressed, with a slight 
change from its traditional statement, thus : r = 1 — ^^ ^f ^' 
where r stands for the coefficient of correlation, ^ for 
"the sum of," D for the differences in rank, and N for 
the number of cases (ordinarily the number of pupils who 
took both tests). An illustration will make the use of the 
formula clear: 

Table VI 





Rank in 


Rank 


in 


Difference 


Difference 


Student 


Test I 


Test 


II 


in Rank 


Squared 


A. C. . 


1 


3 




2 


4 


B. D. . 


2 


1 




1 


1 


E. J. . 


3 


4 




1 


1 


R. L. . 


4 


2 




2 


4 


S. M. . 


5 


9 




4 


16 


E. F. . 


6 


6 










R. S. . 


7 


16 




9 


81 


A. V. . 


8 


5 




3 


9 



90 



PUTTING MEANING INTO SCORES 



Student 

s. z. .. 




Rank in 
Test I 

. 9 


Rank 
Test 

7 


in 
II 


Difference 
in Rank 

2 


Difference 
Squared 

4 


L. 


M. . 




.10 


18 






8 




64 


N. 


O. .. 




.11 


10 






1 




1 


P. 


Q. •■ 




.12 


17 






5 




25 


N. 


M. . 




.13 


8 






5 




25 


N. 


L. .. 




.14 


11 






3 




9 


V. 


W. . 




.15 


12 






3 




9 


X. 


Z. .. 




.16 


14 






2 




4. 


s. 


P. .. 




.17 


15 






2 




4 


s. 


D. .. 




.18 


13 






5 




25 


R. 


O. .. 




.19 


20 






1 




1 


L. 


K. .. 




.20 


19 






1 




1 






288 




r - 1 ^ ^ ^' 

N(N* — 1) 


= 1- 


6X288 
20(202—1) " 


= .78 






PR 


^ 


6745 (l-r2) 


.6745(1 


-.782) 





.06 





VN 



V20 



Summary of Steps in Rank-Difference Method of 
Calculating the Coefficient of Correlation 

1. Find the rank of each pupil in each of the two 
tests, and place these numbers opposite his name, as shown 
in Table VI. 

2. Subtract each pupil's rank in one test from his 
rank in the other, obtaining column headed "Difference in 
Rank." 

3. Square each of these differences, obtaining column 
headed "Difference Squared." 

4. Add this column of squares, obtaining % 

5. Multiply this sum by 6. 

6. Square the number of pupils, subtract 1, and multi- 
ply the result by the number of pupils. 

7. Divide the result of step 5 by the result of step 6. 

8. Subtract the quotient from 1. 

9. This is the coefficient of correlation. 

91 



STANDARDIZED TESTS 

Accuracy of the Coefficient. The accuracy of a corre- 
lation coefficient is proportional to the number of scores 
involved and to the closeness of the agreement between 
them. The formula for the "probable error" of a coeffi- 



6745 



IS 



cient of correlation is, PE ='-^-^ (1 — r^), where N i 
the number of pupils. A coefficient of correlation of .78 
with a probable error of .06 would mean that the coeffi- 
cient may be taken to lie between .72 and .84. The prob- 
able error increases rapidly as the number of cases falls 
off, and coefficients of correlation should not be calculated 
for classes smaller than twenty. 

Meaning of the Size of the Coefficient. A perfect 
agreement between two sets of scores — the same rank 
given each pupil by both tests — would give a coefficient 
of correlation of -|-1.00; perfect reversal of ranking would 
give — 1.00; and the absence of any significant relation- 
ship between the two rankings would give 0. The signifi- 
cance of any obtained coefficient therefore depends upon 
how closely it approaches 1. In practice it is found that 
a coefficient of correlation less than .20 or .30 indicates 
an agreement so slight as to be insignificant; that a co- 
efficient between .20 or .30 and .50 or .60 indicates a 
significant but not a very close agreement; that a coeffi- 
cient between .50 or .60 and .70 or .80 indicates a close 
agreement; and that a coefficient higher than this indicates 
an agreement that is very close, indeed. The exact meaning 
of such a result, however, depends upon the kind of data 
from which it was derived, and the teacher should be 
very careful in interpreting coefficients of correlation. For 
example, small consistent changes of rank throughout the 
class may make the coefficient of correlation quite low, 
while as a matter of fact the agreement between the two 
sets of scores may be close enough for the purpose the 
teacher has in mind. This would be true if the purpose 
was to classify pupils into groups of uniform ability, for 
many small changes of rank would not here be significant, 

92 



PUTTING MEANING INTO SCORES 



since they would not cause pupils to move far up or down 
the scale and thus different tests would not place them 
in different sections. 

The Correlation Table. Perhaps for the classroom 
teacher the most useful measure of relationship, in addi- 
tion to the frequency table described on pages 87 and 88, 
is the correlation table. This is very similar in construc- 
tion to the scatter diagram. An example will make it plain 
(Table VII) : 

Table VII 



Scores in 
Otis Test 


Scores 


in 


Illinois General 
Examination 


Intellig 


ence 






Below 


68 


68-81 


82-95 


96-109 


1104 


Totals 


135+ 






1 


2 


4 


3 


10 


115-134 






4 


9 


8 


7 


28 


95-114 


2 




11 


21 


4 


3 


41 


75-94 


7 




15 


10 


2 


. . 


34 


Below 75 


3 




3 


1 






7 


Totals 


12 




34 


43 


18 


13 


120 



This table reads as follows: One pupil received a score 
above 135 on the Otis test, but a score between 68 and 81 
on the Illinois Examination — was placed in the upper fifth 
or quintile of his class by one test and in the fourth 
quintile by the other; two pupils received a score above 
135 on the Otis test, but a score between 82 and 95 on 
the Illinois Examination — were placed in the upper quin- 
tile by one test and in the third quintile by the other; and 
so on. The pupils about whom the tests agree lie on the 
diagonal running from lower-left to upper-right. Adding 
up these numbers, we have 3 + 15-|-21-|-8 + 3 = 50, 



STANDARDIZED TESTS 

which is 42% of the total number, 120. The pupils about 
whom there is substantial but not complete agreement will 
lie adjacent to the diagonal. Adding up these numbers, 
we have 7 + 11+9 + 4 + 7 + 4+10 + 3 = 55, which 
is 46% of 120. Combining 42% and 46% we have 88% 
as the portion of the class in which the two tests are in 
at least substantial agreement. This is just the kind of 
information the teacher needs in practical work with in- 
telligence measurements. 

Another use for such a table is to show whether the 
data are of such a kind as to make the calculation of the 
:oefficient of correlation worth while. If the larger num- 
ber of the scores lie in an approximately straight line, or 
if a line drawn through the centers of each column so 
as to have an equal number of scores above and below it 
is straight, then the coefficient of correlation is applicable; 
but if such a line is curved, the coefficient of correlation 
is not applicable. It is therefore well to make such a table 
even if the correlation coefficient is to be computed by 
the method shown above, which does not directly utilize 
the table. 

In making a correlation table one should make the 
spaced intervals (in this table, 68-81, etc.) equal to each 
other in each set of scores (though not necessarily in both 
sets), and should so select them that the line of totals 
at the bottom and sides of the table will at least roughly 
correspond to the "normal" distribution, that is, will take 
something like these percentage distributions: 7%, 24%, 
38%, 24%, 7%. This can be done by counting in from 
either end of the ordered distribution of scores (as in 
Table II or Table III), until one has passed about 7% 
of the cases. The two scores thus located can then be 
subtracted, and the difference divided by three (if quintiles 
are to be used) ; this will give the approximate size of 
the score interval. When very large numbers have been 
tested, a finer distribution can be made, if desired, by 

94 



\ 



PUTTING MEANING INTO SCORES 

dividing the scale into sevenths instead of fifths. In this 
case the result of the subtraction, just described, will be 
divided by five to get the size of the score interval. 

Summary. The three types of measures which the 
teacher will need are: measures of central tendency, meas- 
ures of variability, and measures of relationship. The 
best measures of central tendency are the average (techni- 
cally called the mean), and the median, and of these two 
the median is almost universally used in work with stand- 
ardized tests. The best measures of variability are the 
quartile deviation, the average or mean deviation, and 
the standard deviation. The best measures of relationship, 
in addition to various types of graphs, are the tables of 
frequency of difference in rank, the sum of the differences 
in rank, the coefficient of correlation, and the correlation 
table. The coefficient of correlation is the measure of 
relationship commonly used in statistical work, but the 
frequency table and the correlation table will probably be 
more useful in work done with standardized tests by the 
classroom teacher. 



95 



INDEX 



Achievement tests — 

definition of, 15 

kinds of, 13, 15 

history of, 42, 43 

objections to, 44 

books on, IZ 
Army tests, 20, 25, Z1 
Bar graphs, 86, 87 
Binet, 14, 18, 24 
Central tendency, 11 
Coefficient of correlation, 90-93 
Colleges, intelligence tests for, 

2,1 
Correlation table, 93 
Frequency table, 76; of differ- 
ences in rank, 87 
Grammar grades, intelligence 

tests for, Z1 
High school, intelligence tests 

for, 2n 
Individual intelligence tests, 14, 

18-20, 27 
Individual teaching, 50 
Intelligence tests — 

defiinition of, 13, 17, 18 

kinds of, 13, 14. 23 

history of, 18-20 

contents of, 20, 21 

lists of, 36, Z1 

uses of, 5, 28-34 

books on, 40 
Kindergarten, intelligence tests 

for, 37 
Kinds of tests, 13-16 
Marks or grades, 8, 9, 25 



Mean or average deviation, 82 

Median, 77-81 

Periodicals treating standard- 
ized tests, 35 

Primary grades, intelligence 
tests for, Zl 

Publishers of tests, 37 

Quartile deviation, 81 

Rank graph, 84, 85 

Scatter diagram, 83 

Scientific method, 12, 44 

Slow thinker and timed tests, 
27 

Standard deviation, 82 

Sum of differences in rank, 89 

Teachers' estimates of intel- 
ligence, 24 

Terman, 19, 24 

True-false test, 67 

Uses of standardized tests — 
for comparison, 5 
for evaluating methods of 

teaching, 5 
for diagnosis, 6, 46 
for motivation, 7, 52 
for pupil rating, 8 
for teacher rating, 10 
for learning new class, 11 
for placing transferred pu- 
pils, 11 

Uses of intelligence tests, 28- 

34 
Will-temperament, tests of, 13, 

16 



u ^"^ 't 



^^^^ 


1 


... 


r 


^o>* 


f 


















^.. \. </ '>S^A". ^-^ ♦* '^^^'. ^«. c'?'* 









.^' 





'^^/1??7^^ 



^ - o , o , ^^ . , 










^^ ^* 












"^■^ <^^4mk°- .//^^A //—•"- 









.^^^^ 







LIBRARY OF CONGRESS 



021 337 888 



ir^f^ 






f. •-■ 













' " J . 





mAf 

mm 



